pith. sign in

arxiv: 2606.20622 · v1 · pith:CAFUFCCAnew · submitted 2026-05-26 · 💻 cs.AI · cs.LG

Darwin Mobile Agent: A Roadmap for Self-Evolution

Pith reviewed 2026-06-29 16:42 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords mobile GUI agentsself-evolving agentsreinforcement learningautonomous agentshuman priors removalcloud phone infrastructurepolicy optimization
0
0 comments X

The pith

The most effective path to general adaptive agents is to remove human priors and let intelligence emerge from interaction with complex mobile GUI environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that following the principle that computation on complex data outperforms hand-crafted designs, the route to capable agents is to strip away human-specified tasks and verification so that behavior can arise from direct engagement with a world far richer than the agent. Mobile graphical user interfaces serve as a concrete, accessible stand-in for that richer world. The authors supply an open-source infrastructure that runs many agent sessions in parallel on cloud phones to solve the data bottleneck that otherwise blocks large-scale reinforcement learning. They map out a sequence for removing the remaining human elements from how tasks are generated, how success is judged, and how past experience is stored.

Core claim

Darwin Mobile Agent supplies both the cloud-phone infrastructure and the conceptual roadmap needed to move from supervised mobile control toward fully autonomous reinforcement learning, by treating the mobile GUI as a Big World proxy and by outlining the staged removal of human priors from task curricula, outcome verification, and memory management.

What carries the argument

The asynchronous agent-environment loop running across many parallel cloud-phone instances, which removes the data-collection bottleneck and supports stable policy optimization as the initial stage of self-evolution.

If this is right

  • Policy optimization becomes feasible at scale inside real mobile interfaces without manual task engineering.
  • Agents can progress through successive stages that eliminate human input from curricula, verification, and memory.
  • The same parallel-cloud setup can support the later stages of the roadmap once the first stage succeeds.
  • The resulting agents would exhibit adaptive behavior across previously unseen mobile applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same removal-of-priors logic could be applied to other rich interfaces such as web browsers or robotic sensors.
  • Success would imply that large-scale unsupervised interaction data can substitute for curated training sets in many domains.
  • A direct test would track whether agents begin inventing their own sub-goals that were never supplied by the initial curriculum.

Load-bearing premise

The mobile GUI domain contains enough openness and complexity that intelligence will appear once human-designed tasks, checks, and memory rules are taken away.

What would settle it

Training agents for extended periods inside the Darwin framework and observing that they acquire no new capabilities on apps or tasks never shown during training would show the domain lacks sufficient richness.

Figures

Figures reproduced from arXiv: 2606.20622 by Daniel Beechey, Derek Yuen, Dezhao Luo, Jianheng Liu, Jun Wang, Kun Shao, Tiantian He, Weilin Luo.

Figure 1
Figure 1. Figure 1: Overview of the Darwin Mobile Agent Framework. The system employs an asyn￾chronous Rollout Aggregator to bridge the gap between slow, parallelised mobile environments and high-throughput agent inference. Completed trajectories are passed to a verification module to generate rewards for continuous policy optimisation. The world. We propose the Mobile Graphical User Interface (GUI) as a practical “Big World”… view at source ↗
Figure 2
Figure 2. Figure 2: Mean Success Rate on SPA-Bench. The plot illustrates the average performance across eight tasks. Each training step represents 256 environment steps (8 phones × 32 steps per rollout), matching the maximum task horizon. The first 30 steps constitute a critic warm-up phase where the policy parameters are fixed. (a) Mean success rate vs. training steps. (b) Training steps vs. wall-clock time [PITH_FULL_IMAGE… view at source ↗
Figure 3
Figure 3. Figure 3: Distributed Scalability Evaluation. (a) Scaling from 8 to 32 phones shows consistent convergence despite increased off-policy data. (b) Throughput improves from 8 to 16 phones but saturates at 32, as the system bottleneck shifts to model inference and batching overhead. results suggest that the Darwin infrastructure can maintain a stable policy optimisation loop in the mobile GUI domain, providing a functi… view at source ↗
Figure 4
Figure 4. Figure 4: Horizontal Scalability across Task Sets. The plot compares the mean success rate when training on 8 tasks (16 phones) versus 16 tasks (32 phones). Despite the increased task diversity, both configurations exhibit similar convergence rates. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task Lifecycle Stability. The panels illustrate the mean success rate across the three protocol phases. The “cycle” protocol successfully manages automated transitions; the agent exhibits a learning trend during the setup and teardown phases that mirrors the progression of the primary task. (a) Initialisation and Warm-up. (b) Truncation Handling [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Critic Robustness Evaluation. (a) Calibrated initialisation prevents policy collapse; stability is further supported by combining this initialisation with a warm-up phase. (b) Treating truncated trajectories as terminal by assuming the next state-value is zero (V (s ′ ) = 0) outperforms bootstrapping from a learned critic. introduce instability or performance degradation. The mean success rate in the prima… view at source ↗
read the original abstract

The goal of artificial intelligence is to create agents capable of general, adaptive behaviour in open-ended environments. Guided by the "Bitter Lesson", we argue that the most effective path toward this goal is to systematically remove human priors and allow intelligence to naturally emerge through interaction with a "Big World" that is orders of magnitude more complex than the agent itself. We propose the mobile Graphical User Interface (GUI) as a practical proxy for such a world and introduce Darwin Mobile Agent, an open-source infrastructure designed as a foundation for autonomous reinforcement learning in this domain. This framework addresses the data-collection bottleneck in real-world mobile interactions by using an asynchronous agent-environment loop across parallel cloud-phone instances. We further propose a conceptual roadmap to systematically remove human priors from three fundamental pillars of a self-evolving agent: task curricula, outcome verification, and memory management. We validate that the Darwin infrastructure provides the stability and scalability required for the first stage of this roadmap: policy optimisation in the GUI domain. This work establishes the practical and theoretical foundation necessary to move toward truly autonomous, self-evolving GUI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Darwin Mobile Agent, an open-source infrastructure using asynchronous agent-environment loops across parallel cloud-phone instances to address data-collection bottlenecks for reinforcement learning in the mobile GUI domain. It presents this domain as a practical proxy for a 'Big World' and outlines a conceptual roadmap for systematically removing human priors from task curricula, outcome verification, and memory management to enable emergent intelligence. The paper asserts that the infrastructure has been validated for stability and scalability sufficient to support the first stage of policy optimisation in this domain.

Significance. If the infrastructure proves stable and scalable and the mobile-GUI proxy supplies adequate complexity once priors are removed, the work could provide a practical foundation for studying autonomous, self-evolving agents and reduce reliance on human-designed curricula and verifiers. The explicit roadmap for prior removal across three pillars is a clear organising contribution. The significance remains prospective because the manuscript is a proposal and infrastructure description rather than an empirical demonstration of emergence or prior removal.

major comments (2)
  1. [Abstract] Abstract: the claim that 'We validate that the Darwin infrastructure provides the stability and scalability required for the first stage of this roadmap' is load-bearing for positioning the infrastructure as the foundation for the proposed self-evolution program, yet the manuscript supplies no experimental setup, metrics, baselines, or results to support this validation.
  2. [Roadmap] Roadmap section (conceptual pillars): the proposal to remove human priors from outcome verification and memory management is presented without a concrete mechanism or falsifiable test for how verification and memory will be replaced by emergent processes once the infrastructure is in place, leaving the central hypothesis ungrounded beyond the initial policy-optimisation stage.
minor comments (1)
  1. The term 'Big World' is used repeatedly without a precise operational definition or reference to prior literature that would allow readers to assess the claimed orders-of-magnitude complexity difference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of the infrastructure and the organizing value of the roadmap. Our manuscript is positioned as an infrastructure proposal together with a high-level conceptual roadmap; we address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'We validate that the Darwin infrastructure provides the stability and scalability required for the first stage of this roadmap' is load-bearing for positioning the infrastructure as the foundation for the proposed self-evolution program, yet the manuscript supplies no experimental setup, metrics, baselines, or results to support this validation.

    Authors: We agree that the validation claim requires substantiation. The current text asserts stability on the basis of the asynchronous design and initial cloud-phone deployments but contains no formal experimental protocol, metrics, baselines, or quantitative results. We will revise the abstract to remove or qualify the claim and will add a short section describing any available preliminary stability observations or explicitly note the absence of formal validation. revision_made: yes revision: yes

  2. Referee: [Roadmap] Roadmap section (conceptual pillars): the proposal to remove human priors from outcome verification and memory management is presented without a concrete mechanism or falsifiable test for how verification and memory will be replaced by emergent processes once the infrastructure is in place, leaving the central hypothesis ungrounded beyond the initial policy-optimisation stage.

    Authors: The roadmap is deliberately conceptual and identifies the three pillars as directions for subsequent research once the infrastructure enables the first stage of policy optimization. No concrete mechanisms or falsifiable tests are supplied because these steps lie beyond the scope of the present proposal. We will revise the text to state explicitly that the verification and memory pillars are high-level objectives without current implementations. The grounding for the initial stage rests on the infrastructure's capacity to support RL without human-designed task curricula; the later pillars remain prospective. revision_made: partial revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a conceptual proposal and infrastructure roadmap rather than a derivation or empirical study containing fitted parameters, predictions, or mathematical claims. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains; the central hypothesis (removing human priors to enable emergence in a mobile-GUI proxy) is presented as a guiding principle supported by a basic stability validation, with no internal reduction to its own inputs. The work is self-contained as an engineering and conceptual foundation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption that scaling interaction with a complex proxy environment will produce general intelligence once human priors are removed; no free parameters or new invented physical entities are introduced.

axioms (1)
  • domain assumption The Bitter Lesson: removing human priors and scaling compute yields superior AI performance
    Explicitly invoked in the abstract as the guiding principle for the proposed approach.
invented entities (1)
  • Darwin Mobile Agent infrastructure no independent evidence
    purpose: Asynchronous parallel cloud-phone platform for GUI reinforcement learning
    Newly proposed system whose independent evidence consists only of the authors' stability claim.

pith-pipeline@v0.9.1-grok · 5734 in / 1259 out tokens · 48658 ms · 2026-06-29T16:42:06.330826+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 15 canonical work pages · 10 internal anchors

  1. [1]

    Three dogmas of reinforcement learning.arXiv preprint arXiv:2407.10583,

    David Abel, Mark K Ho, and Anna Harutyunyan. Three dogmas of reinforcement learning.arXiv preprint arXiv:2407.10583,

  2. [2]

    Qwen3-VL Technical Report

    URLhttps://arxiv.org/abs/2511.21631. Michael Bowling, John D Martin, David Abel, and Will Dabney. Settling the reward hypothesis. In International Conference on Machine Learning, pages 3003–3020. PMLR,

  3. [3]

    Rethinking the foundations for continual reinforcement learning.arXiv preprint arXiv:2504.08161,

    Esraa Elelimy, David Szepesvari, Martha White, and Michael Bowling. Rethinking the foundations for continual reinforcement learning.arXiv preprint arXiv:2504.08161,

  4. [4]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-Group policy optimization for LLM agent training.arXiv preprint arXiv:2505.10978,

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

  6. [6]

    Discovering temporal structure: An overview of hierarchical reinforcement learning.arXiv preprint arXiv:2506.14045,

    Martin Klissarov, Akhil Bagaria, Ziyan Luo, George Konidaris, Doina Precup, and Marlos C Machado. Discovering temporal structure: An overview of hierarchical reinforcement learning.arXiv preprint arXiv:2506.14045,

  7. [7]

    MobileUse: A GUI agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853,

    Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, and Weinan Zhang. MobileUse: A GUI agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853,

  8. [8]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    14 Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

  9. [9]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. AndroidWorld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

  10. [10]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

  11. [11]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  12. [12]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  13. [13]

    Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup

    URL https: //openreview.net/forum?id=qPMLvJxtPK. Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. AndroidEnv: A reinforcement learning platform for Android.arXiv preprint arXiv:2105.13231,

  14. [14]

    URL http://arxiv.org/abs/2105. 13231. Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-Agent-v3: Fundamental agents for GUI automation. arXiv preprint arXiv:2508.15144,

  15. [15]

    VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

    Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. V APO: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

  16. [16]

    e-commerce, social, utilities) derived from production applications with visually dynamic interfaces and naturally occurring UI variations

    The tasks for each experiment are selected from SPA-Bench (Chen et al., 2025), which covers diverse everyday workflows (e.g. e-commerce, social, utilities) derived from production applications with visually dynamic interfaces and naturally occurring UI variations. It includes both English and Chinese instructions, enabling evaluation of multilingual agent...