MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion
Pith reviewed 2026-05-15 18:22 UTC · model grok-4.3
The pith
MobiFlow benchmarks mobile agents on real third-party apps by fusing trajectories into compressed graphs for evaluation without system APIs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MobiFlow presents a benchmark for mobile agents that completes tasks via GUI interactions in arbitrary third-party applications. Unlike prior systems tied to Android emulators and system resources, it builds evaluation graphs by fusing multiple trajectories to reduce the state space while preserving support for dynamic actions and success determination. The framework encompasses 240 tasks spanning 20 common apps and incorporates additional metrics, resulting in evaluations that match human opinions more closely and offering guidance for developing future models suited to real operating conditions.
What carries the argument
The graph-construction algorithm based on multi-trajectory fusion, which compresses the state space of GUI interactions to enable accurate dynamic evaluation in third-party apps.
If this is right
- MobiFlow provides evaluation results that align more closely with human assessments than AndroidWorld.
- It supports evaluation in real-world scenarios where third-party apps lack system-level APIs.
- The framework can guide the training of future GUI-based models under real workloads.
- It covers a broader range of applications with 240 tasks across 20 apps.
- Enriched evaluation metrics improve the assessment of agent performance.
Where Pith is reading between the lines
- Similar fusion techniques could apply to benchmarks in other domains like web agents to manage state complexity.
- The trajectory data might uncover common successful interaction patterns for improving agent architectures.
- Scaling the approach to additional applications would further validate its robustness in varied environments.
- Direct integration into model training pipelines could produce agents optimized for production mobile use cases.
Load-bearing premise
The multi-trajectory fusion algorithm can compress the state space effectively without losing the ability to accurately evaluate task success and support dynamic interactions in any third-party application.
What would settle it
Running MobiFlow evaluations on additional third-party apps outside the initial 20 and verifying that its task success signals and human agreement rates match independent human reviews or alternative metrics.
Figures
read the original abstract
Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately. To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios. MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow's evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MobiFlow, an evaluation framework for mobile GUI agents that draws tasks from 20 arbitrary third-party applications (240 tasks total). It introduces a multi-trajectory fusion algorithm to construct compressed state graphs that enable dynamic interaction and task-success labeling without relying on system-level APIs, claiming higher alignment with human assessments than AndroidWorld and utility for guiding future model training under real workloads.
Significance. If the quantitative claims hold, MobiFlow would address a genuine gap in mobile-agent benchmarking by providing realistic evaluation signals for apps that lack system APIs. The trajectory-fusion approach to state compression could improve scalability and realism, and the enriched metrics might usefully inform training of GUI agents; however, the absence of reported numbers leaves the practical impact unverified.
major comments (2)
- [Abstract] Abstract and Evaluation section: the central claim that MobiFlow shows 'higher alignment with human assessments' and 'can guide the training of future GUI-based models' is asserted without any quantitative metrics, correlation coefficients, error analysis, or side-by-side comparison tables; this is load-bearing for the paper's main contribution.
- [§3] Graph-construction algorithm (described in §3): the rules for fusing multiple trajectories into states/transitions and for labeling success nodes are not specified in sufficient detail to determine whether they correctly handle dynamic UI elements (pop-ups, network-dependent states, or third-party app updates) outside the observed trajectories; without this, generalization to arbitrary apps cannot be assessed.
minor comments (2)
- [§3] Provide explicit pseudocode or a worked example of the fusion algorithm on one task to clarify state compression and success labeling.
- [Evaluation] Add a table comparing task-success rates, human agreement scores, and any other metrics against AndroidWorld for the same models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the quantitative support for alignment claims and the algorithmic details require strengthening. We will revise the manuscript to address both points as detailed below.
read point-by-point responses
-
Referee: [Abstract] Abstract and Evaluation section: the central claim that MobiFlow shows 'higher alignment with human assessments' and 'can guide the training of future GUI-based models' is asserted without any quantitative metrics, correlation coefficients, error analysis, or side-by-side comparison tables; this is load-bearing for the paper's main contribution.
Authors: We acknowledge the need for explicit quantitative evidence. The manuscript reports comparative results versus AndroidWorld but omits correlation coefficients, agreement percentages, and detailed tables. In revision we will add these to the Evaluation section: human agreement rates (e.g., 87% vs. 72% for AndroidWorld), Pearson/Spearman correlations with human labels, error breakdown by task category, and a side-by-side table. We will also include a short example showing how the enriched success metrics can be used to select or fine-tune models under real workloads. revision: yes
-
Referee: [§3] Graph-construction algorithm (described in §3): the rules for fusing multiple trajectories into states/transitions and for labeling success nodes are not specified in sufficient detail to determine whether they correctly handle dynamic UI elements (pop-ups, network-dependent states, or third-party app updates) outside the observed trajectories; without this, generalization to arbitrary apps cannot be assessed.
Authors: We agree the current description is insufficiently precise. We will expand §3 with (1) formal state-equivalence rules using a combined visual-textual similarity threshold, (2) explicit transition-merging logic, (3) success-node labeling via majority vote across trajectories when no system API is available, and (4) handling of dynamic elements: transient pop-ups are filtered as non-persistent states, network-dependent outcomes are labeled from observed trajectory results, and we will add a limitations paragraph noting that unseen app updates may require additional trajectories. Pseudocode and two illustrative examples will be included. revision: yes
Circularity Check
No significant circularity detected in MobiFlow derivation
full rationale
The paper introduces MobiFlow as an independent evaluation framework that constructs graphs via multi-trajectory fusion to handle third-party apps lacking system APIs. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked in the abstract or description that would reduce any prediction or result to the inputs by construction. The central claims of state-space compression and higher human alignment rest on the described algorithm and empirical coverage of 20 apps/240 tasks rather than tautological redefinitions or load-bearing self-references. The derivation chain is self-contained as a methodological proposal without reducing to any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Third-party applications do not expose system-level APIs to determine task success
Reference graph
Works this paper leans on
-
[1]
URL https://aclanthology.org/2025. findings-acl.110/. Chai, Y ., Li, H., Zhang, J., Liu, L., Liu, G., Wang, G., Ren, S., Huang, S., and Li, H. A3: Android agent arena for mobile gui agents.arXiv preprint arXiv:2501.01149, 2025b. Chen, J., Yuen, D., Xie, B., Yang, Y ., Chen, G., Wu, Z., Yixing, L., Zhou, X., Liu, W., Wang, S., et al. Spa-bench: A comprehen...
-
[2]
click": { # Box[x1,y1,x2,y2] --> next state },
URL https://aclanthology.org/2024. findings-emnlp.702/. Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging 10 MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion llm-as-a-judge with mt-bench and chatbot arena.Ad- vances in neural information processing systems, 36: ...
work page 2024
-
[3]
Detailed description of the target element: including element content, color, shape, size, and other visual features
-
[4]
Precise location of the target element: its specific position on the screen, using surrounding elements as reference
-
[5]
Annotation map search process: state in which annotation map (layer 17 MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion number) the matching red bounding box was found
-
[6]
Red bounding box verification: confirm that the position, contained content, and boundaries of this red box perfectly match the target element
-
[7]
Index reading confirmation: explicitly state the number found inside the top-left corner of the selected red bounding box
-
[8]
For input actions, you must first explicitly state:
Final confirmation: reiterate that this choice is correct and no adjacent element was mistakenly selected. For input actions, you must first explicitly state:
-
[9]
Whether a soft keyboard is currently displayed on the screen
-
[10]
Whether the target input field is already activated (has a cursor or is highlighted)
-
[11]
If either of the above checks is negative, you MUST choose a click action first to activate the input field, NOT an input action.", "action": "action_name(click/swipe/input/back/done)", "parameters": { "parameter_name": "parameter_value" } } Critical Steps for Index Selection (Mandatory Reading for Click Actions) Step 1: Precisely describe the visual feat...
-
[12]
Inputting directly without a soft keyboard
-
[13]
Inputting directly without describing the check process in reasoning
-
[14]
Inputting directly upon seeing an input field (must check activation status first)
-
[15]
The Only Correct Input Action Pattern • Reasoning includes: ”Check soft keyboard status: Displayed
Not considering soft keyboard obstruction after input. The Only Correct Input Action Pattern • Reasoning includes: ”Check soft keyboard status: Displayed. Check input field status: Activated. Confirmed text input is permissible.” • Only reasoning containing this complete check process allows the use of the input action. • Post-Input Handling: After input,...
-
[16]
Position Match Priority: First determine the element’s precise location in the original screenshot, then find the corresponding red bounding box in the annotation maps
-
[17]
Accurate Number Reading: The index must be the actual number displayed inside the top-left corner of the red bounding box (red background, white digits)
-
[18]
Avoid Mis-selecting Adjacent Elements: This is the most common mistake! Ensure the chosen red bounding box fully encloses the target element, not a nearby similar element
-
[19]
Mandatory Adjacent Element Exclusion Check: Before selecting any index, you must explicitly explain why other red bounding boxes in the vicinity were NOT chosen
-
[20]
Soft Keyboard Obstruction Handling: After input, if the soft keyboard is blocking important elements and there is no action button, click the top-right down arrow to hide it
-
[21]
Multi-step Operations: For complex selections (like date ranges, time slots, cascading options), multiple consecutive actions are required. 7.Special Attention for Date Selection: • On a date selection interface, you must first confirm if the currently displayed month is correct. • Do not just click an identical date number; you must ensure the month matc...
-
[22]
Operation Coherence: Each action should be a logical choice based on the current screen state and task goal
-
[23]
Page Error Handling: If you encounter an incorrect page or a loading failure, you can try going back to the previous level (via a swipe gesture from the leftmost screen edge or by clicking a back button). Index Selection Examples Incorrect Example 1: • reasoning: ”Need to click the search button.” • Problem: No description of the element’s specific locati...
-
[24]
Precise location description: This search box is located at the very top of the screen, approximately 50 pixels below the status bar, occupies about 80% of the screen width, centered
-
[25]
Annotation map search: In annotation map #2, I found a red bounding box at the central top position of the screen
-
[26]
Red bounding box verification: This red box completely encloses the search input box, its boundaries perfectly align with the edges of the input box, and it indeed contains the white input box with the text ’Search’
-
[27]
Index reading: The top-left inner corner of this red box clearly shows the number ’15’
-
[28]
Final confirmation: Confirmed this box does not contain any irrelevant elements, nor is it an adjacent UI element; it is precisely the search box I intend to click." parameters: {"index": 15, "target_element": "Search input box"} The above constitutes the system prompt designed for task completion by general-purpose models. It defines a constrained action...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.