Darwin Mobile Agent: A Roadmap for Self-Evolution

Daniel Beechey; Derek Yuen; Dezhao Luo; Jianheng Liu; Jun Wang; Kun Shao; Tiantian He; Weilin Luo

arxiv: 2606.20622 · v1 · pith:CAFUFCCAnew · submitted 2026-05-26 · 💻 cs.AI · cs.LG

Darwin Mobile Agent: A Roadmap for Self-Evolution

Daniel Beechey , Derek Yuen , Jianheng Liu , Dezhao Luo , Tiantian He , Weilin Luo , Jun Wang , Kun Shao This is my paper

Pith reviewed 2026-06-29 16:42 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords mobile GUI agentsself-evolving agentsreinforcement learningautonomous agentshuman priors removalcloud phone infrastructurepolicy optimization

0 comments

The pith

The most effective path to general adaptive agents is to remove human priors and let intelligence emerge from interaction with complex mobile GUI environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that following the principle that computation on complex data outperforms hand-crafted designs, the route to capable agents is to strip away human-specified tasks and verification so that behavior can arise from direct engagement with a world far richer than the agent. Mobile graphical user interfaces serve as a concrete, accessible stand-in for that richer world. The authors supply an open-source infrastructure that runs many agent sessions in parallel on cloud phones to solve the data bottleneck that otherwise blocks large-scale reinforcement learning. They map out a sequence for removing the remaining human elements from how tasks are generated, how success is judged, and how past experience is stored.

Core claim

Darwin Mobile Agent supplies both the cloud-phone infrastructure and the conceptual roadmap needed to move from supervised mobile control toward fully autonomous reinforcement learning, by treating the mobile GUI as a Big World proxy and by outlining the staged removal of human priors from task curricula, outcome verification, and memory management.

What carries the argument

The asynchronous agent-environment loop running across many parallel cloud-phone instances, which removes the data-collection bottleneck and supports stable policy optimization as the initial stage of self-evolution.

If this is right

Policy optimization becomes feasible at scale inside real mobile interfaces without manual task engineering.
Agents can progress through successive stages that eliminate human input from curricula, verification, and memory.
The same parallel-cloud setup can support the later stages of the roadmap once the first stage succeeds.
The resulting agents would exhibit adaptive behavior across previously unseen mobile applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same removal-of-priors logic could be applied to other rich interfaces such as web browsers or robotic sensors.
Success would imply that large-scale unsupervised interaction data can substitute for curated training sets in many domains.
A direct test would track whether agents begin inventing their own sub-goals that were never supplied by the initial curriculum.

Load-bearing premise

The mobile GUI domain contains enough openness and complexity that intelligence will appear once human-designed tasks, checks, and memory rules are taken away.

What would settle it

Training agents for extended periods inside the Darwin framework and observing that they acquire no new capabilities on apps or tasks never shown during training would show the domain lacks sufficient richness.

Figures

Figures reproduced from arXiv: 2606.20622 by Daniel Beechey, Derek Yuen, Dezhao Luo, Jianheng Liu, Jun Wang, Kun Shao, Tiantian He, Weilin Luo.

**Figure 1.** Figure 1: Overview of the Darwin Mobile Agent Framework. The system employs an asynchronous Rollout Aggregator to bridge the gap between slow, parallelised mobile environments and high-throughput agent inference. Completed trajectories are passed to a verification module to generate rewards for continuous policy optimisation. The world. We propose the Mobile Graphical User Interface (GUI) as a practical “Big World”… view at source ↗

**Figure 2.** Figure 2: Mean Success Rate on SPA-Bench. The plot illustrates the average performance across eight tasks. Each training step represents 256 environment steps (8 phones × 32 steps per rollout), matching the maximum task horizon. The first 30 steps constitute a critic warm-up phase where the policy parameters are fixed. (a) Mean success rate vs. training steps. (b) Training steps vs. wall-clock time [PITH_FULL_IMAGE… view at source ↗

**Figure 3.** Figure 3: Distributed Scalability Evaluation. (a) Scaling from 8 to 32 phones shows consistent convergence despite increased off-policy data. (b) Throughput improves from 8 to 16 phones but saturates at 32, as the system bottleneck shifts to model inference and batching overhead. results suggest that the Darwin infrastructure can maintain a stable policy optimisation loop in the mobile GUI domain, providing a functi… view at source ↗

**Figure 4.** Figure 4: Horizontal Scalability across Task Sets. The plot compares the mean success rate when training on 8 tasks (16 phones) versus 16 tasks (32 phones). Despite the increased task diversity, both configurations exhibit similar convergence rates. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Task Lifecycle Stability. The panels illustrate the mean success rate across the three protocol phases. The “cycle” protocol successfully manages automated transitions; the agent exhibits a learning trend during the setup and teardown phases that mirrors the progression of the primary task. (a) Initialisation and Warm-up. (b) Truncation Handling [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Critic Robustness Evaluation. (a) Calibrated initialisation prevents policy collapse; stability is further supported by combining this initialisation with a warm-up phase. (b) Treating truncated trajectories as terminal by assuming the next state-value is zero (V (s ′ ) = 0) outperforms bootstrapping from a learned critic. introduce instability or performance degradation. The mean success rate in the prima… view at source ↗

read the original abstract

The goal of artificial intelligence is to create agents capable of general, adaptive behaviour in open-ended environments. Guided by the "Bitter Lesson", we argue that the most effective path toward this goal is to systematically remove human priors and allow intelligence to naturally emerge through interaction with a "Big World" that is orders of magnitude more complex than the agent itself. We propose the mobile Graphical User Interface (GUI) as a practical proxy for such a world and introduce Darwin Mobile Agent, an open-source infrastructure designed as a foundation for autonomous reinforcement learning in this domain. This framework addresses the data-collection bottleneck in real-world mobile interactions by using an asynchronous agent-environment loop across parallel cloud-phone instances. We further propose a conceptual roadmap to systematically remove human priors from three fundamental pillars of a self-evolving agent: task curricula, outcome verification, and memory management. We validate that the Darwin infrastructure provides the stability and scalability required for the first stage of this roadmap: policy optimisation in the GUI domain. This work establishes the practical and theoretical foundation necessary to move toward truly autonomous, self-evolving GUI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a concrete open-source infrastructure for parallel mobile GUI interactions plus a clear three-pillar roadmap, but the self-evolution claims rest on untested assumptions with only basic stability checks shown.

read the letter

The new piece is the Darwin infrastructure itself: an asynchronous loop across cloud-phone instances that lets you run many real mobile GUI sessions in parallel without the usual data-collection choke point. That is a practical engineering step for anyone who wants to train agents on actual device interfaces rather than simulators. The three-pillar roadmap—removing human priors from task curricula, outcome verification, and memory management—is laid out plainly and gives a usable structure for future work. They also report that the basic setup is stable and scales for ordinary policy optimization, which is the necessary first checkpoint.

The soft spot is that nothing tests the central hypothesis. The abstract claims validation of stability and scalability, but supplies no metrics, baselines, or runs that touch prior removal or emergence. The assumption that mobile GUIs are open and complex enough to drive general intelligence once priors are stripped is stated but not probed. Without even a small demonstration on one pillar, the roadmap stays conceptual.

This is for people building or studying real-world interactive agents who need a starting codebase and a way to organize the next steps. The citation pattern is standard and points to relevant prior GUI-agent work. The thinking is coherent on its own terms even if the big claims are forward-looking.

I would send it to peer review. The infrastructure is usable and the roadmap is explicit, so referees can give concrete feedback on both the engineering and the missing experiments.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Darwin Mobile Agent, an open-source infrastructure using asynchronous agent-environment loops across parallel cloud-phone instances to address data-collection bottlenecks for reinforcement learning in the mobile GUI domain. It presents this domain as a practical proxy for a 'Big World' and outlines a conceptual roadmap for systematically removing human priors from task curricula, outcome verification, and memory management to enable emergent intelligence. The paper asserts that the infrastructure has been validated for stability and scalability sufficient to support the first stage of policy optimisation in this domain.

Significance. If the infrastructure proves stable and scalable and the mobile-GUI proxy supplies adequate complexity once priors are removed, the work could provide a practical foundation for studying autonomous, self-evolving agents and reduce reliance on human-designed curricula and verifiers. The explicit roadmap for prior removal across three pillars is a clear organising contribution. The significance remains prospective because the manuscript is a proposal and infrastructure description rather than an empirical demonstration of emergence or prior removal.

major comments (2)

[Abstract] Abstract: the claim that 'We validate that the Darwin infrastructure provides the stability and scalability required for the first stage of this roadmap' is load-bearing for positioning the infrastructure as the foundation for the proposed self-evolution program, yet the manuscript supplies no experimental setup, metrics, baselines, or results to support this validation.
[Roadmap] Roadmap section (conceptual pillars): the proposal to remove human priors from outcome verification and memory management is presented without a concrete mechanism or falsifiable test for how verification and memory will be replaced by emergent processes once the infrastructure is in place, leaving the central hypothesis ungrounded beyond the initial policy-optimisation stage.

minor comments (1)

The term 'Big World' is used repeatedly without a precise operational definition or reference to prior literature that would allow readers to assess the claimed orders-of-magnitude complexity difference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of the infrastructure and the organizing value of the roadmap. Our manuscript is positioned as an infrastructure proposal together with a high-level conceptual roadmap; we address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'We validate that the Darwin infrastructure provides the stability and scalability required for the first stage of this roadmap' is load-bearing for positioning the infrastructure as the foundation for the proposed self-evolution program, yet the manuscript supplies no experimental setup, metrics, baselines, or results to support this validation.

Authors: We agree that the validation claim requires substantiation. The current text asserts stability on the basis of the asynchronous design and initial cloud-phone deployments but contains no formal experimental protocol, metrics, baselines, or quantitative results. We will revise the abstract to remove or qualify the claim and will add a short section describing any available preliminary stability observations or explicitly note the absence of formal validation. revision_made: yes revision: yes
Referee: [Roadmap] Roadmap section (conceptual pillars): the proposal to remove human priors from outcome verification and memory management is presented without a concrete mechanism or falsifiable test for how verification and memory will be replaced by emergent processes once the infrastructure is in place, leaving the central hypothesis ungrounded beyond the initial policy-optimisation stage.

Authors: The roadmap is deliberately conceptual and identifies the three pillars as directions for subsequent research once the infrastructure enables the first stage of policy optimization. No concrete mechanisms or falsifiable tests are supplied because these steps lie beyond the scope of the present proposal. We will revise the text to state explicitly that the verification and memory pillars are high-level objectives without current implementations. The grounding for the initial stage rests on the infrastructure's capacity to support RL without human-designed task curricula; the later pillars remain prospective. revision_made: partial revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a conceptual proposal and infrastructure roadmap rather than a derivation or empirical study containing fitted parameters, predictions, or mathematical claims. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains; the central hypothesis (removing human priors to enable emergence in a mobile-GUI proxy) is presented as a guiding principle supported by a basic stability validation, with no internal reduction to its own inputs. The work is self-contained as an engineering and conceptual foundation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption that scaling interaction with a complex proxy environment will produce general intelligence once human priors are removed; no free parameters or new invented physical entities are introduced.

axioms (1)

domain assumption The Bitter Lesson: removing human priors and scaling compute yields superior AI performance
Explicitly invoked in the abstract as the guiding principle for the proposed approach.

invented entities (1)

Darwin Mobile Agent infrastructure no independent evidence
purpose: Asynchronous parallel cloud-phone platform for GUI reinforcement learning
Newly proposed system whose independent evidence consists only of the authors' stability claim.

pith-pipeline@v0.9.1-grok · 5734 in / 1259 out tokens · 48658 ms · 2026-06-29T16:42:06.330826+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 15 canonical work pages · 10 internal anchors

[1]

Three dogmas of reinforcement learning.arXiv preprint arXiv:2407.10583,

David Abel, Mark K Ho, and Anna Harutyunyan. Three dogmas of reinforcement learning.arXiv preprint arXiv:2407.10583,

work page arXiv
[2]

Qwen3-VL Technical Report

URLhttps://arxiv.org/abs/2511.21631. Michael Bowling, John D Martin, David Abel, and Will Dabney. Settling the reward hypothesis. In International Conference on Machine Learning, pages 3003–3020. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Rethinking the foundations for continual reinforcement learning.arXiv preprint arXiv:2504.08161,

Esraa Elelimy, David Szepesvari, Martha White, and Michael Bowling. Rethinking the foundations for continual reinforcement learning.arXiv preprint arXiv:2504.08161,

work page arXiv
[4]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-Group policy optimization for LLM agent training.arXiv preprint arXiv:2505.10978,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Discovering temporal structure: An overview of hierarchical reinforcement learning.arXiv preprint arXiv:2506.14045,

Martin Klissarov, Akhil Bagaria, Ziyan Luo, George Konidaris, Doina Precup, and Marlos C Machado. Discovering temporal structure: An overview of hierarchical reinforcement learning.arXiv preprint arXiv:2506.14045,

work page arXiv
[7]

MobileUse: A GUI agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853,

Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, and Weinan Zhang. MobileUse: A GUI agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853,

work page arXiv
[8]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

14 Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. AndroidWorld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup

URL https: //openreview.net/forum?id=qPMLvJxtPK. Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. AndroidEnv: A reinforcement learning platform for Android.arXiv preprint arXiv:2105.13231,

work page arXiv
[14]

URL http://arxiv.org/abs/2105. 13231. Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-Agent-v3: Fundamental agents for GUI automation. arXiv preprint arXiv:2508.15144,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. V APO: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

e-commerce, social, utilities) derived from production applications with visually dynamic interfaces and naturally occurring UI variations

The tasks for each experiment are selected from SPA-Bench (Chen et al., 2025), which covers diverse everyday workflows (e.g. e-commerce, social, utilities) derived from production applications with visually dynamic interfaces and naturally occurring UI variations. It includes both English and Chinese instructions, enabling evaluation of multilingual agent...

2025

[1] [1]

Three dogmas of reinforcement learning.arXiv preprint arXiv:2407.10583,

David Abel, Mark K Ho, and Anna Harutyunyan. Three dogmas of reinforcement learning.arXiv preprint arXiv:2407.10583,

work page arXiv

[2] [2]

Qwen3-VL Technical Report

URLhttps://arxiv.org/abs/2511.21631. Michael Bowling, John D Martin, David Abel, and Will Dabney. Settling the reward hypothesis. In International Conference on Machine Learning, pages 3003–3020. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Rethinking the foundations for continual reinforcement learning.arXiv preprint arXiv:2504.08161,

Esraa Elelimy, David Szepesvari, Martha White, and Michael Bowling. Rethinking the foundations for continual reinforcement learning.arXiv preprint arXiv:2504.08161,

work page arXiv

[4] [4]

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-Group policy optimization for LLM agent training.arXiv preprint arXiv:2505.10978,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Discovering temporal structure: An overview of hierarchical reinforcement learning.arXiv preprint arXiv:2506.14045,

Martin Klissarov, Akhil Bagaria, Ziyan Luo, George Konidaris, Doina Precup, and Marlos C Machado. Discovering temporal structure: An overview of hierarchical reinforcement learning.arXiv preprint arXiv:2506.14045,

work page arXiv

[7] [7]

MobileUse: A GUI agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853,

Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, and Weinan Zhang. MobileUse: A GUI agent with hierarchical reflection for autonomous mobile operation.arXiv preprint arXiv:2507.16853,

work page arXiv

[8] [8]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

14 Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. UI-TARS: Pioneering automated GUI interaction with native agents.arXiv preprint arXiv:2501.12326,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. AndroidWorld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup

URL https: //openreview.net/forum?id=qPMLvJxtPK. Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. AndroidEnv: A reinforcement learning platform for Android.arXiv preprint arXiv:2105.13231,

work page arXiv

[14] [14]

URL http://arxiv.org/abs/2105. 13231. Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-Agent-v3: Fundamental agents for GUI automation. arXiv preprint arXiv:2508.15144,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. V APO: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

e-commerce, social, utilities) derived from production applications with visually dynamic interfaces and naturally occurring UI variations

The tasks for each experiment are selected from SPA-Bench (Chen et al., 2025), which covers diverse everyday workflows (e.g. e-commerce, social, utilities) derived from production applications with visually dynamic interfaces and naturally occurring UI variations. It includes both English and Chinese instructions, enabling evaluation of multilingual agent...

2025