OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
Pith reviewed 2026-05-10 11:33 UTC · model grok-4.3
The pith
OpenMobile synthesizes open task instructions and trajectories that train vision-language models to reach over 50 percent success on mobile agent benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenMobile is an open-source framework whose scalable task synthesis pipeline constructs a global environment memory from exploration to produce diverse grounded instructions and whose policy-switching strategy for trajectory rollout alternates learner and expert models to capture essential error-recovery data. Fine-tuned Qwen2.5-VL and Qwen3-VL models trained on this data reach 51.7 percent and 64.7 percent success on AndroidWorld, exceeding existing open-data approaches, with analyses verifying that performance stems from broad functionality coverage rather than benchmark contamination.
What carries the argument
The policy-switching strategy for trajectory rollout, which alternates between learner and expert models to collect error-recovery data missing from standard imitation learning.
If this is right
- Fine-tuned open vision-language models can reach competitive success rates on dynamic mobile benchmarks such as AndroidWorld.
- Inclusion of error-recovery trajectories from policy switching improves robustness in handling failures during task execution.
- Transparent overlap analysis between synthetic data and test sets confirms that gains reflect broad coverage rather than contamination.
- Public release of the generated data and code removes the data barrier for community research on mobile agents.
Where Pith is reading between the lines
- The same global-memory-plus-policy-switching approach could be tested for generating training data in related agent settings such as web navigation or desktop automation.
- Expanding the scale of the initial exploration phase might produce even richer task diversity for handling longer-horizon mobile workflows.
- The emphasis on error recovery highlights a general limitation in pure imitation learning for agents and suggests policy switching as a reusable technique.
Load-bearing premise
The synthetic tasks and trajectories are sufficiently diverse, grounded in real mobile interfaces, and free of benchmark contamination to produce genuine generalization rather than overfitting.
What would settle it
Measuring that models trained on the OpenMobile data show no meaningful improvement over baselines when evaluated on a fresh collection of mobile apps and tasks drawn entirely outside the original exploration memory would indicate the synthesis fails to generalize.
Figures
read the original abstract
Mobile agents powered by vision-language models have demonstrated impressive capabilities in automating mobile tasks, with recent leading models achieving a marked performance leap, e.g., nearly 70% success on AndroidWorld. However, these systems keep their training data closed and remain opaque about their task and trajectory synthesis recipes. We present OpenMobile, an open-source framework that synthesizes high-quality task instructions and agent trajectories, with two key components: (1) The first is a scalable task synthesis pipeline that constructs a global environment memory from exploration, then leverages it to generate diverse and grounded instructions. and (2) a policy-switching strategy for trajectory rollout. By alternating between learner and expert models, it captures essential error-recovery data often missing in standard imitation learning. Agents trained on our data achieve competitive results across three dynamic mobile agent benchmarks: notably, our fine-tuned Qwen2.5-VL and Qwen3-VL reach 51.7% and 64.7% on AndroidWorld, far surpassing existing open-data approaches. Furthermore, we conduct transparent analyses on the overlap between our synthetic instructions and benchmark test sets, and verify that performance gains stem from broad functionality coverage rather than benchmark overfitting. We release data and code at https://njucckevin.github.io/openmobile/ to bridge the data gap and facilitate broader mobile agent research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OpenMobile, an open-source framework for synthesizing task instructions and agent trajectories to train vision-language model-based mobile agents. It features a scalable task synthesis pipeline that builds a global environment memory from exploration to generate diverse, grounded instructions, combined with a policy-switching strategy during trajectory rollout to capture error-recovery behaviors absent in standard imitation learning. Fine-tuned Qwen2.5-VL and Qwen3-VL models achieve 51.7% and 64.7% success on AndroidWorld (surpassing prior open-data baselines), with similar competitive results on two other dynamic benchmarks; the authors include overlap analyses between synthetic data and benchmark test sets to argue that gains reflect broad coverage rather than overfitting. Data and code are released.
Significance. If the central performance claims hold under rigorous controls, this would be a meaningful contribution to mobile agent research by closing the data gap left by closed-source systems and providing a reproducible pipeline for task/trajectory synthesis. The explicit release of data and code is a clear strength that supports broader adoption and follow-on work. The policy-switching approach for error recovery is a targeted methodological idea worth further exploration.
major comments (2)
- [§5] §5 (Overlap Analysis): The manuscript states that 'transparent analyses on the overlap' were conducted to rule out benchmark overfitting, but provides no quantitative details such as the exact matching method (string vs. embedding similarity), overlap percentage, or threshold for declaring contamination. Without these, it is impossible to evaluate whether the analysis addresses semantic or functional equivalence (e.g., differently worded but identical UI flows), which directly bears on the validity of attributing the 51.7%/64.7% AndroidWorld gains to generalization.
- [§4] §4 (Experiments): The headline success rates (51.7% for Qwen2.5-VL and 64.7% for Qwen3-VL on AndroidWorld) are reported without error bars, standard deviations, number of evaluation runs, or ablation studies on key components such as the global memory or policy-switching. This absence makes it difficult to determine whether the improvements over open-data baselines are statistically reliable or sensitive to implementation details.
minor comments (2)
- [Abstract] The abstract and §4 would benefit from stating the total scale of the released dataset (number of tasks and trajectories) to allow readers to contextualize the synthesis effort.
- [§3] Figure captions in the trajectory examples section could more explicitly describe the visual elements shown (e.g., which UI states correspond to learner vs. expert actions).
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive feedback. We address the major comments below and will incorporate revisions to improve the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: §5 (Overlap Analysis): The manuscript states that 'transparent analyses on the overlap' were conducted to rule out benchmark overfitting, but provides no quantitative details such as the exact matching method (string vs. embedding similarity), overlap percentage, or threshold for declaring contamination. Without these, it is impossible to evaluate whether the analysis addresses semantic or functional equivalence (e.g., differently worded but identical UI flows), which directly bears on the validity of attributing the 51.7%/64.7% AndroidWorld gains to generalization.
Authors: We appreciate this observation. While the manuscript mentions transparent analyses, we agree that more quantitative details are needed to fully address potential concerns about overfitting. In the revised version, we will expand the overlap analysis section to specify the matching method (e.g., a combination of string matching and embedding similarity), report the exact overlap percentages, and include a threshold for contamination. We will also provide examples to show how semantic and functional equivalence are considered, strengthening the argument that gains reflect broad coverage. revision: yes
-
Referee: §4 (Experiments): The headline success rates (51.7% for Qwen2.5-VL and 64.7% for Qwen3-VL on AndroidWorld) are reported without error bars, standard deviations, number of evaluation runs, or ablation studies on key components such as the global memory or policy-switching. This absence makes it difficult to determine whether the improvements over open-data baselines are statistically reliable or sensitive to implementation details.
Authors: We concur that statistical details and ablations are important for validating the results. We will update the experiments section to include error bars and standard deviations from repeated evaluation runs, specify the number of runs performed, and add ablation studies on the global environment memory and policy-switching components. These revisions will help demonstrate the robustness of our findings. revision: yes
Circularity Check
No circularity: empirical pipeline with external benchmark grounding
full rationale
The paper describes a task/trajectory synthesis pipeline (global memory + policy switching) whose outputs are used to fine-tune VLMs, with success rates then measured directly on AndroidWorld and other benchmarks. The overlap analysis is presented as an explicit check to support the non-overfitting interpretation, but this is an author-provided validation step rather than a self-referential reduction of the performance numbers to the synthesis inputs. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain; the headline results remain independent experimental outcomes against fixed external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Vision-language models can reliably interpret and act on synthetically generated natural-language instructions in mobile UIs.
- domain assumption Alternating between learner and expert policies during rollout produces useful error-recovery trajectories without introducing harmful distribution shift.
Reference graph
Works this paper leans on
-
[1]
A screenshot of a UI screen (Screen Before) with the action area marked in red
-
[2]
The action type performed
-
[3]
The resulting screenshot after the action (Screen After)
-
[4]
type": "functionality
The name of the Android app 15 Preprint. Under review. Your task is to analyze the elements on thesecond screenshot (Screen After)ONLY. The first screenshot is provided only as context to help you understand the app’s state. Each element should be output as a dictionary: { "type": "functionality" or "data", "label": "A short phrase describing its identifi...
-
[5]
A recalled screenshot of a specific screen
-
[6]
Several screenshots in short-term memory that have transition relationships with this screenshot (screens that can be reached from the current screen)
-
[7]
Guidelines 16 Preprint
Importantly, some functionalities retrieved from long-term memory that are associated with the current screen (semantically related functionalities from other screens in the same app) Based on these three sources of information, you should fully associate, imagine, and generate long-range, high-level tasks/instructions that are possible within the current...
-
[8]
Your ONLY task is to synthesize clear multi-step GUI instructions
The provided screenshots and functionalities are only a portion of your recalled memories serving as context. Your ONLY task is to synthesize clear multi-step GUI instructions. The instructions you synthesize do not need to have direct connections with the current screen or operations, but can be inferred from the context. However, to ensure the difficult...
-
[9]
Set an alarm for tomorrow at 8 AM that repeats every weekday
There are two types of tasks to generate: • Action tasks: Require performing a series of actions to accomplish a goal. For example: “Set an alarm for tomorrow at 8 AM that repeats every weekday.” • Question-answering tasks: Require performing a series of actions and answering a question related to the environment’s content. For example: “In my to-do list,...
-
[10]
Help me create a new event in the calendar
Synthesized tasksmust be clear and explicit. Generated tasks should be specific with sufficient details, so that executors will not feel confused. For example, “Help me create a new event in the calendar” is too broad. It should include concrete configurations, e.g., date, time, title, description, duration, location, etc
-
[11]
Synthesized tasks must be executable.If you want to generate a task that involves operating on app data (for example, deleting an entry in the calendar), you MUST make sure the data you want to operate on is present in the given screenshots
-
[12]
Do not only focus on the app’s main functions
Generated tasks should be diverse. Do not only focus on the app’s main functions. Try to cover all functionalities of the app as much as possible, for example, elements or functions in corners of screens, or functionalities you associate from memories
-
[13]
Do not generate single-step tasks such as clicking a button
Generated tasks should be long-range. Do not generate single-step tasks such as clicking a button. You are encouraged to generate tasks that require executors to reason, plan, and complete in multiple steps.You can also consider combining different sub-functions or sub-tasks into a long-range task, but ensure reasonableness
-
[14]
Generated tasks should be high-level.Do not generate step-by-step instructions and detailed actions.Instead, integrate multi-step instructions into a high-level intent to increase task difficulty.They should be a single command that contains specific details, rather than step-by-step operations for completing a task
-
[15]
Do not generate tasks that are bound to temporary states of the current interface (for example, a popup dialog that appears)
Generated tasks should start from the phone’s home screen, not from the currently pro- vided screen. Do not generate tasks that are bound to temporary states of the current interface (for example, a popup dialog that appears)
-
[16]
Access and manage the list of all saved Bluetooth devices
The operating environment is a virtual device with no network connection. Do not generate tasks that require internet connection or login. However, you can freely use data that is already saved in the existing app. Example Tasks Here are examples showing bad tasks and their improved versions: Example 1: – Bad: “Access and manage the list of all saved Blue...
-
[17]
type”: functionality, “description
“type”: functionality, “description”:{description 1}
-
[18]
These are screens that can be reached from the current screen: ### Associated Screen 1 Elements({N}items):
These are screens that can transition into the current screen: ### Preceding Screen 1 Elements({N}items): ... These are screens that can be reached from the current screen: ### Associated Screen 1 Elements({N}items): ... ## Related Functionalities from Other Screens ({M}items) These are semantically related functionalities from other screens in the same a...
-
[19]
reasoning
## Your Task Based on the above context, carefully analyze and think, then generate 1–3 high-quality GUI tasks. Each task should be a concise but high-level instruction in English. Output format (JSON array): [ {"reasoning": "...", "task": "task instruction 1"}, {"reasoning": "...", "task": "task instruction 2"} ] B Experiment Settings B.1 Benchmark Evalu...
-
[20]
and the MobileWorld leaderboard (Kong et al., 2025). B.2 Policy-Switching Rollout Settings We use Qwen2.5-VL-7B-Instruct as the base model (learner πl) for all policy-switching ablations, with Gemini-3.1-Pro-Preview as the expert πe.Expert distillation.The expert model executes all synthesized instructions. We retain trajectories where the expert signals ...
2025
-
[21]
19 Preprint
framework, which provides an infrastructure of over a hundred Android emulator Method Base Model AndroidWorld AndroidLab MobileWorld Pass@1↑Pass@3↑ Pass@1↑Pass@3↑ Pass@1↑Pass@3↑ Qwen3-VL-8B – 47.6 ± 2.2 62.1 43.5 – 9.4 – Ours-8B Qwen3-VL 64.7 ± 3.2 78.0 51.5 ± 0.7 62.3 17.7 ± 2.2 24.8 Ours-8B-RL Qwen3-VL 64.1 ± 0.5 77.6 53.9 ± 1.5 63.0 16.8 ± 0.9 20.5 Tab...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.