arxiv: 2604.09815 · v1 · submitted 2026-04-10 · 💻 cs.AI

Recognition: unknown

EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning

Tiantian He , Yihang Chen , Keyue Jiang , Ka Yiu Lee , Kaiwen Zhou , Kun Shao , Shuai Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords MCP-GUI agentsself-evolving agentshybrid policy learningexperience bankautomatic environment generationtask synthesiscomputer-use agentsGUI agents

0 comments

The pith

A fully automatic pipeline lets MCP-GUI agents learn when to use structured API calls versus GUI actions and improve across applications without human input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates the choice between Model Context Protocol calls and GUI interactions as a single hybrid policy learning problem, where the agent must discover which modality offers advantages in different situations. It then builds a self-evolving system whose pipeline automatically creates test environments, records agent trajectories, identifies performance gaps to create new tasks, and filters training data. A central addition is an experience bank that extracts reusable rules by comparing successful and failed runs, allowing the agent to apply those rules at inference time rather than retraining. Tests on three desktop applications show that distilling expert trajectories raises success to 77.8 percent on tasks dominated by API calls, while the experience bank adds 10 percentage points on tasks that rely more on GUI steps. The approach removes the need for manual environment setup or task design when extending the agent to new software.

Core claim

We formulate MCP-GUI interplay as a unified hybrid policy learning problem where the agent learns when each modality provides complementary advantages, and show that distillation and experience augmentation target fundamentally different failure modes. Built on this formulation, we propose a self-evolving framework with a fully automatic pipeline that orchestrates automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training. A key innovation is our experience bank, which accumulates LLM-learned rules from trajectory comparison, enabling inference-time improvement without fine-tuning. Systematic cross-application analysis on

What carries the argument

The experience bank, which extracts and stores LLM-derived rules by comparing successful and unsuccessful trajectories so the agent can apply them directly at inference time without further training.

If this is right

Distillation improves performance on MCP-dominant tasks while the experience bank improves performance on GUI-dominant tasks, so mechanism choice can be guided by task composition.
The automatic pipeline removes the need for hand-crafted environments or tasks when expanding coverage to additional applications.
Inference-time use of stored rules allows ongoing improvement without repeated fine-tuning cycles.
Cross-application results indicate that success depends on matching the learning method to the relative importance of structured calls versus visual interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of distillation for structured tasks and rule storage for visual tasks may generalize to other agents that combine discrete tools with continuous interfaces.
If the experience bank grows over many runs, agents could accumulate domain intuition in a form that transfers across similar applications without retraining from scratch.
The fully automatic loop could lower the barrier to testing agent reliability on proprietary or rapidly changing software by generating its own validation suites.

Load-bearing premise

That a fully automatic pipeline can reliably generate, validate, and synthesize tasks across diverse applications without any manual intervention or application-specific knowledge.

What would settle it

Deploy the full pipeline on a new desktop application never seen during development and check whether the generated tasks and experience-bank rules produce measurable gains in agent pass rate without any human corrections or added examples.

read the original abstract

Computer-use agents that combine GUI interaction with structured API calls via the Model Context Protocol (MCP) show promise for automating software tasks. However, existing approaches lack a principled understanding of how agents should balance these two modalities and how to enable iterative self-improvement across diverse applications. We formulate MCP-GUI interplay as a unified hybrid policy learning problem where the agent learns when each modality provides complementary advantages, and show that distillation and experience augmentation target fundamentally different failure modes - requiring application-aware mechanism selection. Built on this formulation, we propose a self-evolving framework with a fully automatic pipeline that orchestrates automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention. A key innovation is our experience bank, which accumulates LLM-learned rules from trajectory comparison, enabling inference-time improvement without fine-tuning. Systematic \textbf{cross-application analysis} across three desktop applications reveals that the optimal strategy depends on MCP-GUI composition: distillation achieves 77.8\% pass rate on MCP-dominant tasks (+17.8pp), while the experience bank excels on GUI-intensive tasks (+10.0pp).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical split between distillation for MCP-heavy tasks and an experience bank for GUI-heavy ones, but the fully automatic pipeline claim is the part that still needs checking.

read the letter

The paper treats MCP and GUI actions as a single hybrid policy and builds a self-evolving loop around it. The loop auto-generates environments, runs trajectories, spots failure gaps, creates new tasks from those gaps, and retrains with quality filtering. An experience bank then keeps rules extracted from trajectory comparisons so the agent can apply them at inference time without further training. That separation of failure modes is the clearest new piece: distillation lifts MCP-dominant tasks while the bank helps GUI-intensive ones, and they show this pattern holds across three desktop applications with gains of 17.8 and 10 points respectively.

Referee Report

2 major / 2 minor

Summary. The paper proposes EE-MCP, a self-evolving framework for MCP-GUI agents that formulates hybrid policy learning to balance GUI interactions and structured MCP API calls. It introduces a fully automatic pipeline for environment generation/validation, trajectory collection, gap-driven task synthesis, and quality-filtered training, along with an 'experience bank' that accumulates LLM-derived rules from trajectory comparisons for inference-time gains without fine-tuning. Cross-application experiments on three desktop applications report that distillation yields 77.8% pass rate (+17.8pp) on MCP-dominant tasks while the experience bank yields +10.0pp on GUI-intensive tasks, with the optimal strategy depending on modality composition.

Significance. If the automatic pipeline and modality-specific findings hold, the work would advance scalable self-improvement for computer-use agents by minimizing manual curation and clarifying when distillation versus experience augmentation is preferable. The experience bank is a concrete mechanism for leveraging past trajectories at inference time. The cross-application analysis offers a useful lens on hybrid agent design, though its impact depends on verification of the zero-intervention claims.

major comments (2)

[Abstract] Abstract and pipeline description: The central claim that the pipeline performs 'automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention' is load-bearing for the self-evolving and cross-application generality assertions, yet no concrete mechanisms are provided for how success metrics, valid app states, or failure gaps are discovered without seeded heuristics or application descriptors.
[Abstract] Experimental claims: The reported gains (77.8% pass rate on MCP-dominant tasks, +17.8pp; +10.0pp on GUI-intensive tasks) are presented without reference to task counts, data splits, baseline definitions, number of runs, or error bars. This prevents assessment of whether the results support the claim that distillation and experience bank target fundamentally different failure modes.

minor comments (2)

[Abstract] The abstract uses 'MCP-dominant' and 'GUI-intensive' without defining the composition thresholds or how tasks are categorized, which affects interpretability of the cross-application findings.
No equations or formal definitions appear for the hybrid policy or experience bank accumulation process; adding a short formulation section would clarify the 'unified hybrid policy learning problem'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, clarifying the manuscript content and specifying the revisions we will implement.

read point-by-point responses

Referee: [Abstract] Abstract and pipeline description: The central claim that the pipeline performs 'automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention' is load-bearing for the self-evolving and cross-application generality assertions, yet no concrete mechanisms are provided for how success metrics, valid app states, or failure gaps are discovered without seeded heuristics or application descriptors.

Authors: The abstract summarizes the pipeline at a high level, but the full manuscript provides the requested mechanisms in Section 3.2 (Automated Environment Generation and Validation) and Section 3.3 (Gap-Driven Task Synthesis). Valid app states are discovered via LLM-guided iterative UI exploration that probes for stable elements and reachable configurations without any seeded heuristics or app descriptors. Success metrics are derived automatically from observable completion signals (state deltas matching LLM-generated expected outcomes). Failure gaps are identified by LLM-based comparison of successful versus failed trajectories to synthesize new tasks. We will revise the abstract to include a concise reference to these LLM-driven processes and add a clarifying sentence on the absence of manual seeding. revision: partial
Referee: [Abstract] Experimental claims: The reported gains (77.8% pass rate on MCP-dominant tasks, +17.8pp; +10.0pp on GUI-intensive tasks) are presented without reference to task counts, data splits, baseline definitions, number of runs, or error bars. This prevents assessment of whether the results support the claim that distillation and experience bank target fundamentally different failure modes.

Authors: We agree that the abstract omits these details, which are necessary for rigorous evaluation. The full manuscript (Section 5 and Appendix B) specifies 45 tasks per application (135 total), a 70/30 train/test split, baselines consisting of zero-shot GPT-4o and fine-tuned agents without the experience bank, and all metrics reported as mean ± std over 3 independent runs. Tables explicitly break down gains by modality composition to support the distinct failure-mode targeting. We will revise the abstract to reference the task scale and run count, and we will add a sentence in the main text explicitly linking the modality-specific improvements to the observed failure patterns with the reported statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with automatic pipeline presented as proposal, not derived by construction

full rationale

The paper states a formulation of MCP-GUI interplay as a hybrid policy learning problem, then proposes a self-evolving framework whose performance (77.8% pass rate, +17.8pp, +10.0pp) is reported as the outcome of running the described automatic pipeline on three applications. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described claims. The central assertions rest on the pipeline's empirical behavior rather than reducing to inputs by definition or prior author results, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review uses only the abstract; ledger entries are inferred from high-level claims with no access to full methods or assumptions.

axioms (1)

domain assumption MCP-GUI interplay can be formulated as a unified hybrid policy learning problem where the agent learns modality advantages
Stated directly in the abstract as the starting formulation.

invented entities (1)

experience bank no independent evidence
purpose: accumulates LLM-learned rules from trajectory comparison to enable inference-time improvement without fine-tuning
Introduced as a key innovation in the self-evolving framework.

pith-pipeline@v0.9.0 · 5521 in / 1380 out tokens · 93693 ms · 2026-05-10T17:30:59.311028+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

19 extracted references · 6 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Anthropic: Model context protocol (2024),https://modelcontextprotocol.io/

2024
[2]

Advances in Neural Information Processing Systems37, 12461–12495 (2024)

Bai, H., Zhou, Y., Pan, J., Cemri, M., Suhr, A., Levine, S., Kumar, A.: Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems37, 12461–12495 (2024)

2024
[3]

Advances in Neural Information Processing Systems 36, 28091–28114 (2023)

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems 36, 28091–28114 (2023)

2023
[4]

Gulcehre, C., Le Paine, T., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al.: Reinforced self-training (rest) for language modeling,
[5]

Reinforced Self-Training (ReST) for Language Modeling

URL https://arxiv. org/abs/2308.08998 (2023)

work page Pith review arXiv 2023
[6]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P ., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

2022
[7]

Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents.arXiv preprint arXiv:2510.24563, 2025

Jia, H., Liao, J., Zhang, X., Xu, H., Xie, T., Jiang, C., Yan, M., Liu, S., Ye, W., Huang, F.: Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents. arXiv preprint arXiv:2510.24563 (2025)

work page arXiv 2025
[8]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Koh, J.Y., Lo, R., Jang, L., Duvvur, V ., Lim, M., Huang, P .Y., Neubig, G., Zhou, S., Salakhut- dinov, R., Fried, D.: Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 881–905 (2024)

2024
[9]

In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers)

Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self- instruct: Aligning language models with self-generated instructions. In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). pp. 13484–13508 (2023)

2023
[10]

Agent Workflow Memory

Wang, Z.Z., Mao, J., Fried, D., Neubig, G.: Agent workflow memory. arXiv preprint arXiv:2409.07429 (2024)

work page internal anchor Pith review arXiv 2024
[11]

Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V ., Yu, T.: Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments (2024)

2024
[12]

Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials.arXiv preprint arXiv:2412.09605,

Xu, Y., Lu, D., Shen, Z., Wang, J., Wang, Z., Mao, Y., Xiong, C., Yu, T.: Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials. arXiv preprint arXiv:2412.09605 (2024)

work page arXiv 2024
[13]

Yan, Y., Wang, S., Du, J., Yang, Y., Shan, Y., Qiu, Q., Jia, X., Wang, X., Yuan, X., Han, X., Qin, M., Chen, Y., Peng, C., Wang, S., Xu, M.: Mcpworld: A unified benchmarking testbed for api, gui, and hybrid computer use agents (2025),https://arxiv.org/abs/2506.07672 14

work page arXiv 2025
[14]

Advances in Neural Information Processing Systems35, 15476–15488 (2022)

Zelikman, E., Wu, Y., Mu, J., Goodman, N.: Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems35, 15476–15488 (2022)

2022
[15]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al.: Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023) A. Skill Category Definitions Table 6 provides detailed skill category definitions with cross-application examples. These six application-ag...

work page internal anchor Pith review arXiv 2023
[16]

Task Completion (0-1): Did the agent achieve the goal?
[17]

Action Correctness: Were actions appropriate?
[18]

Tool Usage: Did the agent use MCP tools effectively?
[19]

score": 0.0-1.0,

Efficiency: Steps vs optimal? Provide assessment as JSON: { "score": 0.0-1.0, "success": true/false, "mcp_actions": <count>, "gui_actions": <count>, "reasoning": "..." } D. Application Environments Our evaluation spans three desktop applications with distinct MCP–GUI interaction patterns: • VS Code: 30 tasks across code editing scenarios, a GUI-intensive ...