arxiv: 2604.15093 · v1 · submitted 2026-04-16 · 💻 cs.AI · cs.CL· cs.CV· cs.HC

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

Kanzhi Cheng , Zehao Li , Zheng Ma , Nuo Chen , Jialin Cao , Qiushi Sun , Zichen Ding , Fangzhi Xu

show 6 more authors

Hang Yan Jiajun Chen Anh Tuan Luu Jianbing Zhang Lewei Lu Dahua Lin

This is my paper

Pith reviewed 2026-05-10 11:33 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.HC

keywords mobile agentstask synthesistrajectory synthesisvision-language modelsopen-source dataAndroidWorldpolicy switchingerror recovery

0 comments

The pith

OpenMobile synthesizes open task instructions and trajectories that train vision-language models to reach over 50 percent success on mobile agent benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mobile agents powered by vision-language models can automate tasks effectively but depend on closed proprietary training data that limits wider progress. The paper introduces an open framework with a task synthesis pipeline that first builds a global environment memory from exploration and then generates diverse grounded instructions from it. A second component uses policy switching during trajectory rollout to alternate learner and expert models and collect error-recovery examples that standard imitation learning overlooks. Models fine-tuned on the resulting data achieve 51.7 percent and 64.7 percent on AndroidWorld while surpassing previous open-data methods. Transparent overlap checks confirm that gains come from broad functional coverage rather than leakage into test sets.

Core claim

OpenMobile is an open-source framework whose scalable task synthesis pipeline constructs a global environment memory from exploration to produce diverse grounded instructions and whose policy-switching strategy for trajectory rollout alternates learner and expert models to capture essential error-recovery data. Fine-tuned Qwen2.5-VL and Qwen3-VL models trained on this data reach 51.7 percent and 64.7 percent success on AndroidWorld, exceeding existing open-data approaches, with analyses verifying that performance stems from broad functionality coverage rather than benchmark contamination.

What carries the argument

The policy-switching strategy for trajectory rollout, which alternates between learner and expert models to collect error-recovery data missing from standard imitation learning.

If this is right

Fine-tuned open vision-language models can reach competitive success rates on dynamic mobile benchmarks such as AndroidWorld.
Inclusion of error-recovery trajectories from policy switching improves robustness in handling failures during task execution.
Transparent overlap analysis between synthetic data and test sets confirms that gains reflect broad coverage rather than contamination.
Public release of the generated data and code removes the data barrier for community research on mobile agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-memory-plus-policy-switching approach could be tested for generating training data in related agent settings such as web navigation or desktop automation.
Expanding the scale of the initial exploration phase might produce even richer task diversity for handling longer-horizon mobile workflows.
The emphasis on error recovery highlights a general limitation in pure imitation learning for agents and suggests policy switching as a reusable technique.

Load-bearing premise

The synthetic tasks and trajectories are sufficiently diverse, grounded in real mobile interfaces, and free of benchmark contamination to produce genuine generalization rather than overfitting.

What would settle it

Measuring that models trained on the OpenMobile data show no meaningful improvement over baselines when evaluated on a fresh collection of mobile apps and tasks drawn entirely outside the original exploration memory would indicate the synthesis fails to generalize.

Figures

Figures reproduced from arXiv: 2604.15093 by Anh Tuan Luu, Dahua Lin, Fangzhi Xu, Hang Yan, Jiajun Chen, Jialin Cao, Jianbing Zhang, Kanzhi Cheng, Lewei Lu, Nuo Chen, Qiushi Sun, Zehao Li, Zheng Ma, Zichen Ding.

**Figure 1.** Figure 1: Performance Comparison. Task success rates across three dynamic mobile agent benchmarks. Our models significantly surpass open-data baselines and are competitive with leading closed-data systems. Data Scaling. AndroidWorld performance with increasing synthesized instructions. Error Correction Capacity. OpenMobile data substantially enhances the agent’s error-recovery ability in live environments. 1 Introd… view at source ↗

**Figure 2.** Figure 2: The overview of OpenMobile. (a) Scalable Task Synthesis. Instead of relying on a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Semantic similarity between synthetic and AndroidWorld instructions. Our synthesized instructions exhibit moderate functionality-level relevance, with only 3.5% exceeding a similarity of 0.7. Right: Impact of removing test-similar instructions from training. Removing a small fraction of the most similar instructions causes only a marginal performance drop, mitigating benchmark overfitting concerns. 8… view at source ↗

**Figure 4.** Figure 4: Left: Functionality coverage of AndroidWorld tasks as synthesized instructions scale. OpenMobile consistently achieves higher coverage than the coupled baseline. Right: Tasks with lower complexity (fewer required functionalities) and higher functionality coverage by synthetic data achieve higher success rates. 5.2 What Drives the Effectiveness of OpenMobile Data? OpenMobile data is grounded in the benchma… view at source ↗

read the original abstract

Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OpenMobile delivers open synthetic tasks and trajectories that lift fine-tuned VLMs to 51.7-64.7% on AndroidWorld, beating prior open baselines, with an overlap check to address contamination.

read the letter

The main point is that this paper supplies a practical open recipe for generating mobile-agent training data and shows that models trained on it outperform earlier open-data efforts on three benchmarks, especially AndroidWorld. The two pieces that appear new are the global environment memory built from exploration to create grounded instructions, and the learner-expert policy switch during rollout to collect recovery trajectories that plain imitation learning usually skips. Both address real gaps in current mobile-agent work. They also release the data and code, which is the most immediately useful part for anyone who wants to train or extend these agents without starting from scratch. The overlap analysis they describe is a step in the right direction for ruling out simple leakage. That said, the abstract gives headline numbers but no dataset sizes, ablation tables, or error bars, so it is difficult to judge how much the new components actually move the needle versus simply having more data. If the overlap check is limited to string matches rather than functional or semantic similarity, some contamination could still slip through, though the paper presents the analysis as transparent. The work is aimed at researchers who need open training resources for phone-automation agents or adjacent domains like web agents. It is solid enough on the engineering side and has enough artifacts that a serious referee should see it, even if the central claims will need tighter numbers and validation in review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces OpenMobile, an open-source framework for synthesizing task instructions and agent trajectories to train vision-language model-based mobile agents. It features a scalable task synthesis pipeline that builds a global environment memory from exploration to generate diverse, grounded instructions, combined with a policy-switching strategy during trajectory rollout to capture error-recovery behaviors absent in standard imitation learning. Fine-tuned Qwen2.5-VL and Qwen3-VL models achieve 51.7% and 64.7% success on AndroidWorld (surpassing prior open-data baselines), with similar competitive results on two other dynamic benchmarks; the authors include overlap analyses between synthetic data and benchmark test sets to argue that gains reflect broad coverage rather than overfitting. Data and code are released.

Significance. If the central performance claims hold under rigorous controls, this would be a meaningful contribution to mobile agent research by closing the data gap left by closed-source systems and providing a reproducible pipeline for task/trajectory synthesis. The explicit release of data and code is a clear strength that supports broader adoption and follow-on work. The policy-switching approach for error recovery is a targeted methodological idea worth further exploration.

major comments (2)

[§5] §5 (Overlap Analysis): The manuscript states that 'transparent analyses on the overlap' were conducted to rule out benchmark overfitting, but provides no quantitative details such as the exact matching method (string vs. embedding similarity), overlap percentage, or threshold for declaring contamination. Without these, it is impossible to evaluate whether the analysis addresses semantic or functional equivalence (e.g., differently worded but identical UI flows), which directly bears on the validity of attributing the 51.7%/64.7% AndroidWorld gains to generalization.
[§4] §4 (Experiments): The headline success rates (51.7% for Qwen2.5-VL and 64.7% for Qwen3-VL on AndroidWorld) are reported without error bars, standard deviations, number of evaluation runs, or ablation studies on key components such as the global memory or policy-switching. This absence makes it difficult to determine whether the improvements over open-data baselines are statistically reliable or sensitive to implementation details.

minor comments (2)

[Abstract] The abstract and §4 would benefit from stating the total scale of the released dataset (number of tasks and trajectories) to allow readers to contextualize the synthesis effort.
[§3] Figure captions in the trajectory examples section could more explicitly describe the visual elements shown (e.g., which UI states correspond to learner vs. expert actions).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback. We address the major comments below and will incorporate revisions to improve the clarity and rigor of the manuscript.

read point-by-point responses

Referee: §5 (Overlap Analysis): The manuscript states that 'transparent analyses on the overlap' were conducted to rule out benchmark overfitting, but provides no quantitative details such as the exact matching method (string vs. embedding similarity), overlap percentage, or threshold for declaring contamination. Without these, it is impossible to evaluate whether the analysis addresses semantic or functional equivalence (e.g., differently worded but identical UI flows), which directly bears on the validity of attributing the 51.7%/64.7% AndroidWorld gains to generalization.

Authors: We appreciate this observation. While the manuscript mentions transparent analyses, we agree that more quantitative details are needed to fully address potential concerns about overfitting. In the revised version, we will expand the overlap analysis section to specify the matching method (e.g., a combination of string matching and embedding similarity), report the exact overlap percentages, and include a threshold for contamination. We will also provide examples to show how semantic and functional equivalence are considered, strengthening the argument that gains reflect broad coverage. revision: yes
Referee: §4 (Experiments): The headline success rates (51.7% for Qwen2.5-VL and 64.7% for Qwen3-VL on AndroidWorld) are reported without error bars, standard deviations, number of evaluation runs, or ablation studies on key components such as the global memory or policy-switching. This absence makes it difficult to determine whether the improvements over open-data baselines are statistically reliable or sensitive to implementation details.

Authors: We concur that statistical details and ablations are important for validating the results. We will update the experiments section to include error bars and standard deviations from repeated evaluation runs, specify the number of runs performed, and add ablation studies on the global environment memory and policy-switching components. These revisions will help demonstrate the robustness of our findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external benchmark grounding

full rationale

The paper describes a task/trajectory synthesis pipeline (global memory + policy switching) whose outputs are used to fine-tune VLMs, with success rates then measured directly on AndroidWorld and other benchmarks. The overlap analysis is presented as an explicit check to support the non-overfitting interpretation, but this is an author-provided validation step rather than a self-referential reduction of the performance numbers to the synthesis inputs. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain; the headline results remain independent experimental outcomes against fixed external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions that vision-language models can follow generated instructions and that simulated mobile environments sufficiently approximate real devices; no new physical constants or invented particles are introduced.

axioms (2)

domain assumption Vision-language models can reliably interpret and act on synthetically generated natural-language instructions in mobile UIs.
Invoked when the pipeline uses the memory to produce instructions that the agent must execute.
domain assumption Alternating between learner and expert policies during rollout produces useful error-recovery trajectories without introducing harmful distribution shift.
Central to the second component; no formal justification supplied in abstract.

pith-pipeline@v0.9.0 · 5590 in / 1450 out tokens · 24423 ms · 2026-05-10T11:33:38.461903+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references

[1]

A screenshot of a UI screen (Screen Before) with the action area marked in red
[2]

The action type performed
[3]

The resulting screenshot after the action (Screen After)
[4]

type": "functionality

The name of the Android app 15 Preprint. Under review. Your task is to analyze the elements on thesecond screenshot (Screen After)ONLY. The first screenshot is provided only as context to help you understand the app’s state. Each element should be output as a dictionary: { "type": "functionality" or "data", "label": "A short phrase describing its identifi...
[5]

A recalled screenshot of a specific screen
[6]

Several screenshots in short-term memory that have transition relationships with this screenshot (screens that can be reached from the current screen)
[7]

Guidelines 16 Preprint

Importantly, some functionalities retrieved from long-term memory that are associated with the current screen (semantically related functionalities from other screens in the same app) Based on these three sources of information, you should fully associate, imagine, and generate long-range, high-level tasks/instructions that are possible within the current...
[8]

Your ONLY task is to synthesize clear multi-step GUI instructions

The provided screenshots and functionalities are only a portion of your recalled memories serving as context. Your ONLY task is to synthesize clear multi-step GUI instructions. The instructions you synthesize do not need to have direct connections with the current screen or operations, but can be inferred from the context. However, to ensure the difficult...
[9]

Set an alarm for tomorrow at 8 AM that repeats every weekday

There are two types of tasks to generate: • Action tasks: Require performing a series of actions to accomplish a goal. For example: “Set an alarm for tomorrow at 8 AM that repeats every weekday.” • Question-answering tasks: Require performing a series of actions and answering a question related to the environment’s content. For example: “In my to-do list,...
[10]

Help me create a new event in the calendar

Synthesized tasksmust be clear and explicit. Generated tasks should be specific with sufficient details, so that executors will not feel confused. For example, “Help me create a new event in the calendar” is too broad. It should include concrete configurations, e.g., date, time, title, description, duration, location, etc
[11]

Synthesized tasks must be executable.If you want to generate a task that involves operating on app data (for example, deleting an entry in the calendar), you MUST make sure the data you want to operate on is present in the given screenshots
[12]

Do not only focus on the app’s main functions

Generated tasks should be diverse. Do not only focus on the app’s main functions. Try to cover all functionalities of the app as much as possible, for example, elements or functions in corners of screens, or functionalities you associate from memories
[13]

Do not generate single-step tasks such as clicking a button

Generated tasks should be long-range. Do not generate single-step tasks such as clicking a button. You are encouraged to generate tasks that require executors to reason, plan, and complete in multiple steps.You can also consider combining different sub-functions or sub-tasks into a long-range task, but ensure reasonableness
[14]

Generated tasks should be high-level.Do not generate step-by-step instructions and detailed actions.Instead, integrate multi-step instructions into a high-level intent to increase task difficulty.They should be a single command that contains specific details, rather than step-by-step operations for completing a task
[15]

Do not generate tasks that are bound to temporary states of the current interface (for example, a popup dialog that appears)

Generated tasks should start from the phone’s home screen, not from the currently pro- vided screen. Do not generate tasks that are bound to temporary states of the current interface (for example, a popup dialog that appears)
[16]

Access and manage the list of all saved Bluetooth devices

The operating environment is a virtual device with no network connection. Do not generate tasks that require internet connection or login. However, you can freely use data that is already saved in the existing app. Example Tasks Here are examples showing bad tasks and their improved versions: Example 1: – Bad: “Access and manage the list of all saved Blue...
[17]

type”: functionality, “description

“type”: functionality, “description”:{description 1}
[18]

These are screens that can be reached from the current screen: ### Associated Screen 1 Elements({N}items):

These are screens that can transition into the current screen: ### Preceding Screen 1 Elements({N}items): ... These are screens that can be reached from the current screen: ### Associated Screen 1 Elements({N}items): ... ## Related Functionalities from Other Screens ({M}items) These are semantically related functionalities from other screens in the same a...
[19]

reasoning

## Your Task Based on the above context, carefully analyze and think, then generate 1–3 high-quality GUI tasks. Each task should be a concise but high-level instruction in English. Output format (JSON array): [ {"reasoning": "...", "task": "task instruction 1"}, {"reasoning": "...", "task": "task instruction 2"} ] B Experiment Settings B.1 Benchmark Evalu...
[20]

and the MobileWorld leaderboard (Kong et al., 2025). B.2 Policy-Switching Rollout Settings We use Qwen2.5-VL-7B-Instruct as the base model (learner πl) for all policy-switching ablations, with Gemini-3.1-Pro-Preview as the expert πe.Expert distillation.The expert model executes all synthesized instructions. We retain trajectories where the expert signals ...

2025
[21]

19 Preprint

framework, which provides an infrastructure of over a hundred Android emulator Method Base Model AndroidWorld AndroidLab MobileWorld Pass@1↑Pass@3↑ Pass@1↑Pass@3↑ Pass@1↑Pass@3↑ Qwen3-VL-8B – 47.6 ± 2.2 62.1 43.5 – 9.4 – Ours-8B Qwen3-VL 64.7 ± 3.2 78.0 51.5 ± 0.7 62.3 17.7 ± 2.2 24.8 Ours-8B-RL Qwen3-VL 64.1 ± 0.5 77.6 53.9 ± 1.5 63.0 16.8 ± 0.9 20.5 Tab...

2026