Recognition: unknown
EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning
Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3
The pith
A fully automatic pipeline lets MCP-GUI agents learn when to use structured API calls versus GUI actions and improve across applications without human input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We formulate MCP-GUI interplay as a unified hybrid policy learning problem where the agent learns when each modality provides complementary advantages, and show that distillation and experience augmentation target fundamentally different failure modes. Built on this formulation, we propose a self-evolving framework with a fully automatic pipeline that orchestrates automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training. A key innovation is our experience bank, which accumulates LLM-learned rules from trajectory comparison, enabling inference-time improvement without fine-tuning. Systematic cross-application analysis on
What carries the argument
The experience bank, which extracts and stores LLM-derived rules by comparing successful and unsuccessful trajectories so the agent can apply them directly at inference time without further training.
If this is right
- Distillation improves performance on MCP-dominant tasks while the experience bank improves performance on GUI-dominant tasks, so mechanism choice can be guided by task composition.
- The automatic pipeline removes the need for hand-crafted environments or tasks when expanding coverage to additional applications.
- Inference-time use of stored rules allows ongoing improvement without repeated fine-tuning cycles.
- Cross-application results indicate that success depends on matching the learning method to the relative importance of structured calls versus visual interactions.
Where Pith is reading between the lines
- The separation of distillation for structured tasks and rule storage for visual tasks may generalize to other agents that combine discrete tools with continuous interfaces.
- If the experience bank grows over many runs, agents could accumulate domain intuition in a form that transfers across similar applications without retraining from scratch.
- The fully automatic loop could lower the barrier to testing agent reliability on proprietary or rapidly changing software by generating its own validation suites.
Load-bearing premise
That a fully automatic pipeline can reliably generate, validate, and synthesize tasks across diverse applications without any manual intervention or application-specific knowledge.
What would settle it
Deploy the full pipeline on a new desktop application never seen during development and check whether the generated tasks and experience-bank rules produce measurable gains in agent pass rate without any human corrections or added examples.
read the original abstract
Computer-use agents that combine GUI interaction with structured API calls via the Model Context Protocol (MCP) show promise for automating software tasks. However, existing approaches lack a principled understanding of how agents should balance these two modalities and how to enable iterative self-improvement across diverse applications. We formulate MCP-GUI interplay as a unified hybrid policy learning problem where the agent learns when each modality provides complementary advantages, and show that distillation and experience augmentation target fundamentally different failure modes - requiring application-aware mechanism selection. Built on this formulation, we propose a self-evolving framework with a fully automatic pipeline that orchestrates automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention. A key innovation is our experience bank, which accumulates LLM-learned rules from trajectory comparison, enabling inference-time improvement without fine-tuning. Systematic \textbf{cross-application analysis} across three desktop applications reveals that the optimal strategy depends on MCP-GUI composition: distillation achieves 77.8\% pass rate on MCP-dominant tasks (+17.8pp), while the experience bank excels on GUI-intensive tasks (+10.0pp).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EE-MCP, a self-evolving framework for MCP-GUI agents that formulates hybrid policy learning to balance GUI interactions and structured MCP API calls. It introduces a fully automatic pipeline for environment generation/validation, trajectory collection, gap-driven task synthesis, and quality-filtered training, along with an 'experience bank' that accumulates LLM-derived rules from trajectory comparisons for inference-time gains without fine-tuning. Cross-application experiments on three desktop applications report that distillation yields 77.8% pass rate (+17.8pp) on MCP-dominant tasks while the experience bank yields +10.0pp on GUI-intensive tasks, with the optimal strategy depending on modality composition.
Significance. If the automatic pipeline and modality-specific findings hold, the work would advance scalable self-improvement for computer-use agents by minimizing manual curation and clarifying when distillation versus experience augmentation is preferable. The experience bank is a concrete mechanism for leveraging past trajectories at inference time. The cross-application analysis offers a useful lens on hybrid agent design, though its impact depends on verification of the zero-intervention claims.
major comments (2)
- [Abstract] Abstract and pipeline description: The central claim that the pipeline performs 'automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention' is load-bearing for the self-evolving and cross-application generality assertions, yet no concrete mechanisms are provided for how success metrics, valid app states, or failure gaps are discovered without seeded heuristics or application descriptors.
- [Abstract] Experimental claims: The reported gains (77.8% pass rate on MCP-dominant tasks, +17.8pp; +10.0pp on GUI-intensive tasks) are presented without reference to task counts, data splits, baseline definitions, number of runs, or error bars. This prevents assessment of whether the results support the claim that distillation and experience bank target fundamentally different failure modes.
minor comments (2)
- [Abstract] The abstract uses 'MCP-dominant' and 'GUI-intensive' without defining the composition thresholds or how tasks are categorized, which affects interpretability of the cross-application findings.
- No equations or formal definitions appear for the hybrid policy or experience bank accumulation process; adding a short formulation section would clarify the 'unified hybrid policy learning problem'.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, clarifying the manuscript content and specifying the revisions we will implement.
read point-by-point responses
-
Referee: [Abstract] Abstract and pipeline description: The central claim that the pipeline performs 'automatic environment generation and validation, trajectory collection, gap-driven task synthesis, and quality-filtered training - all without manual intervention' is load-bearing for the self-evolving and cross-application generality assertions, yet no concrete mechanisms are provided for how success metrics, valid app states, or failure gaps are discovered without seeded heuristics or application descriptors.
Authors: The abstract summarizes the pipeline at a high level, but the full manuscript provides the requested mechanisms in Section 3.2 (Automated Environment Generation and Validation) and Section 3.3 (Gap-Driven Task Synthesis). Valid app states are discovered via LLM-guided iterative UI exploration that probes for stable elements and reachable configurations without any seeded heuristics or app descriptors. Success metrics are derived automatically from observable completion signals (state deltas matching LLM-generated expected outcomes). Failure gaps are identified by LLM-based comparison of successful versus failed trajectories to synthesize new tasks. We will revise the abstract to include a concise reference to these LLM-driven processes and add a clarifying sentence on the absence of manual seeding. revision: partial
-
Referee: [Abstract] Experimental claims: The reported gains (77.8% pass rate on MCP-dominant tasks, +17.8pp; +10.0pp on GUI-intensive tasks) are presented without reference to task counts, data splits, baseline definitions, number of runs, or error bars. This prevents assessment of whether the results support the claim that distillation and experience bank target fundamentally different failure modes.
Authors: We agree that the abstract omits these details, which are necessary for rigorous evaluation. The full manuscript (Section 5 and Appendix B) specifies 45 tasks per application (135 total), a 70/30 train/test split, baselines consisting of zero-shot GPT-4o and fine-tuned agents without the experience bank, and all metrics reported as mean ± std over 3 independent runs. Tables explicitly break down gains by modality composition to support the distinct failure-mode targeting. We will revise the abstract to reference the task scale and run count, and we will add a sentence in the main text explicitly linking the modality-specific improvements to the observed failure patterns with the reported statistics. revision: yes
Circularity Check
No circularity: empirical framework with automatic pipeline presented as proposal, not derived by construction
full rationale
The paper states a formulation of MCP-GUI interplay as a hybrid policy learning problem, then proposes a self-evolving framework whose performance (77.8% pass rate, +17.8pp, +10.0pp) is reported as the outcome of running the described automatic pipeline on three applications. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described claims. The central assertions rest on the pipeline's empirical behavior rather than reducing to inputs by definition or prior author results, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MCP-GUI interplay can be formulated as a unified hybrid policy learning problem where the agent learns modality advantages
invented entities (1)
-
experience bank
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
Reference graph
Works this paper leans on
-
[1]
Anthropic: Model context protocol (2024),https://modelcontextprotocol.io/
2024
-
[2]
Advances in Neural Information Processing Systems37, 12461–12495 (2024)
Bai, H., Zhou, Y., Pan, J., Cemri, M., Suhr, A., Levine, S., Kumar, A.: Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. Advances in Neural Information Processing Systems37, 12461–12495 (2024)
2024
-
[3]
Advances in Neural Information Processing Systems 36, 28091–28114 (2023)
Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems 36, 28091–28114 (2023)
2023
-
[4]
Gulcehre, C., Le Paine, T., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al.: Reinforced self-training (rest) for language modeling,
-
[5]
Reinforced Self-Training (ReST) for Language Modeling
URL https://arxiv. org/abs/2308.08998 (2023)
work page Pith review arXiv 2023
-
[6]
Iclr1(2), 3 (2022)
Hu, E.J., Shen, Y., Wallis, P ., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)
2022
-
[7]
Jia, H., Liao, J., Zhang, X., Xu, H., Xie, T., Jiang, C., Yan, M., Liu, S., Ye, W., Huang, F.: Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents. arXiv preprint arXiv:2510.24563 (2025)
-
[8]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Koh, J.Y., Lo, R., Jang, L., Duvvur, V ., Lim, M., Huang, P .Y., Neubig, G., Zhou, S., Salakhut- dinov, R., Fried, D.: Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 881–905 (2024)
2024
-
[9]
In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers)
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self- instruct: Aligning language models with self-generated instructions. In: Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). pp. 13484–13508 (2023)
2023
-
[10]
Wang, Z.Z., Mao, J., Fried, D., Neubig, G.: Agent workflow memory. arXiv preprint arXiv:2409.07429 (2024)
work page internal anchor Pith review arXiv 2024
-
[11]
Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V ., Yu, T.: Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments (2024)
2024
-
[12]
Xu, Y., Lu, D., Shen, Z., Wang, J., Wang, Z., Mao, Y., Xiong, C., Yu, T.: Agenttrek: Agent trajectory synthesis via guiding replay with web tutorials. arXiv preprint arXiv:2412.09605 (2024)
- [13]
-
[14]
Advances in Neural Information Processing Systems35, 15476–15488 (2022)
Zelikman, E., Wu, Y., Mu, J., Goodman, N.: Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems35, 15476–15488 (2022)
2022
-
[15]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al.: Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023) A. Skill Category Definitions Table 6 provides detailed skill category definitions with cross-application examples. These six application-ag...
work page internal anchor Pith review arXiv 2023
-
[16]
Task Completion (0-1): Did the agent achieve the goal?
-
[17]
Action Correctness: Were actions appropriate?
-
[18]
Tool Usage: Did the agent use MCP tools effectively?
-
[19]
score": 0.0-1.0,
Efficiency: Steps vs optimal? Provide assessment as JSON: { "score": 0.0-1.0, "success": true/false, "mcp_actions": <count>, "gui_actions": <count>, "reasoning": "..." } D. Application Environments Our evaluation spans three desktop applications with distinct MCP–GUI interaction patterns: • VS Code: 30 tasks across code editing scenarios, a GUI-intensive ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.