arxiv: 2604.13822 · v1 · submitted 2026-04-15 · 💻 cs.LG

Recognition: unknown

UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

Zhengxi Lu , Fei Tang , Guangyi Liu , Kaitao Song , Xu Tan , Jin Ma , Wenqi Zhang , Weiming Lu

show 3 more authors

Jun Xiao Yueting Zhuang Yongliang Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:55 UTC · model grok-4.3

classification 💻 cs.LG

keywords GUI agentslong-horizon taskstool-integrated policy optimizationmemory decouplingmulti-modal language modelsAndroidWorldMemGUI-Benchtool use

0 comments

The pith

UI-Copilot lets a GUI agent call a lightweight copilot for memory retrieval and calculations only when needed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome memory degradation, progress confusion, and numerical hallucinations that limit MLLM-based GUI agents in long sequences of interface actions. It does so by creating a collaborative setup in which the main agent executes the task while a smaller copilot supplies on-demand help, with memory split into persistent observations and transient context. A new training method, Tool-Integrated Policy Optimization, teaches the agent to decide when to invoke the copilot as either a retriever or a calculator. If the approach holds, agents can sustain coherent behavior over dozens of steps without the main model being forced to store or compute everything internally.

Core claim

UI-Copilot is a collaborative framework in which the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. Memory decoupling separates persistent observations from transient execution context. The policy agent is trained to selectively invoke the copilot as Retriever or Calculator using Tool-Integrated Policy Optimization, which optimizes tool selection separately through single-turn prediction and task execution through on-policy multi-turn rollouts. UI-Copilot-7B reaches state-of-the-art results on MemGUI-Bench and records a 17.1 percent absolute gain on AndroidWorld over the base Qwen model.

What carries the argument

Tool-Integrated Policy Optimization (TIPO), which trains tool selection with single-turn prediction while training task execution with on-policy multi-turn rollouts.

If this is right

Outperforms other 7B-scale GUI agents on MemGUI-Bench.
Delivers a 17.1 percent absolute improvement on AndroidWorld over the base Qwen model.
Reduces memory degradation and numerical hallucinations during extended interaction sequences.
Maintains strong generalization when moving from benchmark tasks to real mobile applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-invocation pattern could be applied to web-browsing or desktop-automation agents that also suffer from context overload.
Keeping the main model focused on execution while off-loading retrieval and arithmetic may lower overall token usage across many agent runs.
The memory-decoupling idea could be tested in non-GUI domains such as multi-step planning or code generation where persistent state must be preserved.

Load-bearing premise

The learned policy for deciding when to call the copilot adds no new errors or latency that cancel out the gains from the assistance it provides.

What would settle it

A side-by-side deployment run that records whether average task completion time or failure rate rises when the policy and copilot calls are active compared with the base agent alone.

Figures

Figures reproduced from arXiv: 2604.13822 by Fei Tang, Guangyi Liu, Jin Ma, Jun Xiao, Kaitao Song, Weiming Lu, Wenqi Zhang, Xu Tan, Yongliang Shen, Yueting Zhuang, Zhengxi Lu.

**Figure 1.** Figure 1: Left: Performance on dynamic GUI benchmarks. Right: Task distribution of these benchmarks. solving short-horizon tasks, which typically require fewer than 10 interaction steps (Rawles et al., 2024; Zhao et al., 2025; Chen et al., 2026), as shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: MemGUI-Bench Inference Case. Our method successfully completes the task by invoking Copilot Model, whereas other models fail due to memory degradation, progress confusion, and math hallucinations. Existing approaches address these limitations through multi-agent workflows (Agashe et al., 2025; Wang et al., 2025b, 2024a) or retrieval augmentation (Liu et al., 2025a; Li et al., 2025b; Xu et al., 2025). Howe… view at source ↗

**Figure 3.** Figure 3: Overview of TIPO Pipeline. Policy model jointly learns tool invocations and multi-turn action prediction. loss and memory hallucination. 3) Efficient Inference. Unlike previous agent workflows (Wang et al., 2024a), the agent selectively invokes external models, simplifying the execution pipeline. 3.2 Dataset Curation We collect N diverse, human-annotated trajectories τ ∗ = {(S ∗ 1 , a∗ 1 ), . . . ,(S ∗ T… view at source ↗

**Figure 4.** Figure 4: Training Dataset Curation Pipeline. 3.3 Tool-Integrated Policy Optimization Our GUI agent M is efficiently trained on the pseudo-labeled D0, during which the copilot model Mc is not involved. Cold Start. We initialize the policy via SFT on Qwen2.5VL-7B using DSFT. Training is performed with a standard cross-entropy loss for next-token prediction, which enables both format learning and behavior cloning from… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Ablations on Training Paradigms and Dataset. Tool calling and multi-turn performance are tested on Tool-call-Test (1000 tasks) and AC-Real (1536 tasks). On/Offpolicy depends on the history summary. best worst vocations consistently decreases during training, indicating the model’s improving ability to use external tools. Notably, compared to AndroidWorld (approximately 6% tool usage), more complex task… view at source ↗

**Figure 8.** Figure 8: Case Study. UI-Copilot-7B successfully completes a math-related task from AndroidWorld (top) and a memory-related task from MemGUI-Bench (bottom). 4.5 Case Study and Analysis Case Study [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Error Type Analysis with Tool Usage. Errors are categorized into Memory Degradation, Progress Confusion, Math Hallucination and Other Fault. case, the Retriever gathers critical information at steps 5, 12, and 18, enabling UI-Copilot-7B to correctly complete the form-filling task at step 23. Additional successful and failed cases are provided in Appendix F, further demonstrating the effectiveness of our … view at source ↗

**Figure 10.** Figure 10: Interaction example for UI-Copilot-7B. B Reward Definition B.1 Type Reward (rtype) rtype ∈ {0, 1} indicates whether the predicted action type matches the ground-truth action type. Let a pred and a gt denote the predicted and ground-truth action types, respectively. The type reward is defined as rtype = I[a pred = a gt]. B.2 Accuracy Reward (racc) racc ∈ {0, 1} evaluates whether the predicted action is … view at source ↗

**Figure 11.** Figure 11: Action Type Distribution of DRL action [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 15.** Figure 15: Difficulty Level Distribution of AndroidWorld [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 17.** Figure 17: Golden steps comparison between Android [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗

**Figure 18.** Figure 18: Pass@k validation on AndroidWorldVerified (60 tasks) and MemGUI-Bench-Verified (55 tasks) [PITH_FULL_IMAGE:figures/full_fig_p016_18.png] view at source ↗

**Figure 19.** Figure 19: Tool Type Distribution on AndroidWorldVerified, MemGUI-Bench-Verified, and MiniWob++ [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗

**Figure 20.** Figure 20: Error Type Analysis of Tool Invocation. F More Cases F.1 Successful Cases Vanilla Execution [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗

**Figure 21.** Figure 21: Successful Cases: Vanilla Rollout without tool usage [PITH_FULL_IMAGE:figures/full_fig_p018_21.png] view at source ↗

**Figure 22.** Figure 22: Successful Cases: Tool-Integrated Rollout with Math Calculator [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗

**Figure 23.** Figure 23: Successful Cases: Tool-Integrated Rollout with Memory Retriever [PITH_FULL_IMAGE:figures/full_fig_p019_23.png] view at source ↗

**Figure 24.** Figure 24: Bad Cases: Reasoning Hallucination [PITH_FULL_IMAGE:figures/full_fig_p020_24.png] view at source ↗

**Figure 25.** Figure 25: Bad Cases: Progress Confusion [PITH_FULL_IMAGE:figures/full_fig_p020_25.png] view at source ↗

**Figure 26.** Figure 26: Bad Cases: Action Inconsistency [PITH_FULL_IMAGE:figures/full_fig_p020_26.png] view at source ↗

**Figure 27.** Figure 27: Prompt for UI-Copilot-7B [PITH_FULL_IMAGE:figures/full_fig_p021_27.png] view at source ↗

**Figure 28.** Figure 28: Prompt for Retriever. Prompt for Calculator. You are a GUI assistant for numerical calculation. Given the task instruction and the interaction history, you need to write executable python code to support numerical calculation. ,→ ,→ The overall task instruction is: '{task}'. The history summary is as follows. # Interaction History step 1: '{step_1}' step 2: '{step_2}' Output the thinking process in `think… view at source ↗

**Figure 29.** Figure 29: Prompt for Calculator [PITH_FULL_IMAGE:figures/full_fig_p022_29.png] view at source ↗

read the original abstract

MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UI-Copilot splits GUI execution from a copilot for memory and calc via memory decoupling and two-stage TIPO, delivering benchmark gains whose source is still unclear.

read the letter

The paper's main contribution is a clean split: the primary GUI agent sticks to task execution while a lightweight copilot handles on-demand memory retrieval and numerical calculation. Memory decoupling keeps persistent observations separate from transient context, and TIPO trains tool selection in single-turn prediction before switching to on-policy multi-turn rollouts for the full trajectory. The 7B model beats other 7B-scale GUI agents on MemGUI-Bench and shows a 17-point absolute lift on AndroidWorld over the base Qwen model. This is a practical engineering pattern for long-horizon GUI work where agents commonly lose track or hallucinate numbers. The staged optimization avoids back-propagating through full trajectories just for tool decisions, which could make training more stable than joint approaches. The idea is straightforward and directly targets real pain points in current MLLM agents. The soft spots sit in the transfer assumption. Single-turn tool selection does not automatically convey the downstream cost of an unnecessary or wrong invocation across many steps, and the abstract supplies no ablations on invocation frequency, tool-error rates, or comparisons against always-call or never-call baselines. Without those controls it is hard to know whether the reported gains come from smarter tool use or simply from the extra training procedure. Latency and new error modes introduced by the copilot itself are also unaddressed. The citation pattern is standard for the subfield and does not hide prior tool-augmented agent work. This paper is for people building or evaluating deployable GUI agents for mobile and desktop automation. A reader looking for a concrete recipe to add selective external tools will find something worth trying, once the full experiments are checked. It deserves peer review because the architecture is explicit and the benchmarks matter for applications, even though the evaluation will need tightening on ablations and error analysis.

Referee Report

2 major / 2 minor

Summary. The paper introduces UI-Copilot, a collaborative framework for long-horizon GUI automation in which a primary GUI agent handles task execution while a lightweight copilot supplies on-demand assistance for memory retrieval (Retriever) and numerical computation (Calculator). It proposes memory decoupling to separate persistent observations from transient execution context and Tool-Integrated Policy Optimization (TIPO), which optimizes tool selection via single-turn prediction and task execution via on-policy multi-turn rollouts. Experiments claim that the resulting UI-Copilot-7B model achieves state-of-the-art results on MemGUI-Bench, outperforming other 7B-scale agents such as GUI-Owl-7B and UI-TARS-1.5-7B, and delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model.

Significance. If the performance gains are substantiated, the work offers a practical route to improving robustness of GUI agents on extended tasks by offloading memory and calculation burdens to specialized tools without enlarging the core model. The TIPO separation of concerns could generalize to other tool-augmented agent settings where single-turn decisions must support multi-turn reliability.

major comments (2)

[5] Section 5 (Experiments): The reported SOTA performance on MemGUI-Bench and the 17.1% absolute improvement on AndroidWorld are presented without any description of training data composition, evaluation protocols, number of runs, statistical significance tests, or ablation controls on the TIPO components. This absence prevents verification of the central empirical claims and their attribution to the proposed method.
[4.2] Section 4.2 (TIPO): The framework optimizes tool selection through single-turn prediction while optimizing execution through separate on-policy multi-turn rollouts, yet provides no explicit alignment, credit-assignment, or auxiliary loss that would allow the policy to internalize the downstream cost of erroneous invocations or to learn appropriate temporal context for when to call the copilot. No analysis demonstrates that single-turn accuracy transfers to full trajectories without compounding errors.

minor comments (2)

[Abstract] The abstract refers to a 'lightweight copilot' but does not indicate its parameter count or architecture relative to the 7B agent, which would help readers assess the claimed efficiency benefit.
[4] Notation for the two copilot modes (Retriever vs. Calculator) and the memory-decoupling variables should be introduced with explicit definitions in the method section rather than only in the experimental narrative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments raise valid points about empirical rigor and methodological transparency that we will address through targeted revisions.

read point-by-point responses

Referee: Section 5 (Experiments): The reported SOTA performance on MemGUI-Bench and the 17.1% absolute improvement on AndroidWorld are presented without any description of training data composition, evaluation protocols, number of runs, statistical significance tests, or ablation controls on the TIPO components. This absence prevents verification of the central empirical claims and their attribution to the proposed method.

Authors: We agree that the current manuscript lacks sufficient experimental details. In the revised version, Section 5 will be expanded to include: a full description of training data composition and sources; complete evaluation protocols with task breakdowns for both MemGUI-Bench and AndroidWorld; the number of runs with mean and standard deviation reporting; statistical significance tests; and dedicated ablation studies on the single-turn and multi-turn components of TIPO. These additions will directly support verification and attribution of the reported gains. revision: yes
Referee: Section 4.2 (TIPO): The framework optimizes tool selection through single-turn prediction while optimizing execution through separate on-policy multi-turn rollouts, yet provides no explicit alignment, credit-assignment, or auxiliary loss that would allow the policy to internalize the downstream cost of erroneous invocations or to learn appropriate temporal context for when to call the copilot. No analysis demonstrates that single-turn accuracy transfers to full trajectories without compounding errors.

Authors: TIPO separates the phases deliberately: single-turn optimization trains accurate, context-aware tool selection, while on-policy multi-turn rollouts expose the policy to full trajectory outcomes, allowing it to learn the downstream effects of tool calls through direct reward signals without needing separate credit-assignment machinery. We will add an analysis subsection in the revision that reports single-turn tool accuracy, its correlation with multi-turn success, and examples of error propagation along with how multi-turn optimization reduces compounding. This will clarify the transfer without introducing auxiliary losses. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method and benchmark results

full rationale

The paper proposes UI-Copilot with memory decoupling and TIPO (separate single-turn tool selection and multi-turn on-policy rollouts), then reports empirical gains on MemGUI-Bench and AndroidWorld. No equations, derivations, or claims reduce any result to fitted inputs or self-citations by construction. TIPO is a training procedure whose outputs are evaluated on held-out benchmarks; the separation of optimization stages is a stated design choice, not a self-definitional loop. No uniqueness theorems, ansatzes smuggled via self-citation, or renaming of known results appear. The work is self-contained against external benchmarks and therefore receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond standard assumptions of reinforcement learning and multimodal model fine-tuning.

pith-pipeline@v0.9.0 · 5557 in / 1103 out tokens · 30881 ms · 2026-05-10T13:55:51.757561+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 3 internal anchors

[1]

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025. Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. 2024. Showui: One vision-language-action model for gui visual agent. Guangyi Liu, Pengxian...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments,

Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075. Guangyi Liu, Pengxiang Zhao, Liang Liu, Zhiming Chen, Yuxiang Chai, Shuai Ren, Hao Wang, Shibo He, and Wenchao Meng. 2025a. Learnact: Few- shot mobile gui agent with a unified demonstration benchmark.arXiv preprint arXiv:2504.13805. Guangyi Liu,...

work page arXiv 2024
[3]

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis,

Os-genesis: Automating gui agent trajec- tory construction via reverse task synthesis.arXiv preprint arXiv:2412.19723. Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, and 1 others. 2025a. Gui-g 2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846. Fei ...

work page arXiv 2026
[4]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Os-atlas: A foundation action model for gener- alist gui agents.arXiv preprint arXiv:2410.23218. Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C Ho, Carl Yang, and Dong Yu. 2025. Retrieval-augmented gui agents with generative guidelines. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing, pages 17877–17886. Yi...

work page internal anchor Pith review arXiv 2025
[5]

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

Mas-bench: A unified benchmark for shortcut- augmented hybrid mobile gui agents.arXiv preprint arXiv:2509.06477. Xurui Zhou, Gongwei Chen, Yuquan Xie, Zaijing Li, Kaiwen Zhou, Shuai Wang, Shuo Yang, Zhuotao Tian, and Rui Shao. 2025. Hiconagent: History context-aware policy optimization for gui agents. arXiv preprint arXiv:2512.01763. A Action Space Action...

work page internal anchor Pith review Pith/arXiv arXiv 2025