pith. sign in

arxiv: 2604.09587 · v1 · submitted 2026-02-28 · 💻 cs.AI · cs.LG· cs.SE

MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion

Pith reviewed 2026-05-15 18:22 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE
keywords mobile agentsGUI benchmarkingtrajectory fusionstate space compressionthird-party applicationsevaluation frameworkreal-world tasksagent alignment
0
0 comments X

The pith

MobiFlow benchmarks mobile agents on real third-party apps by fusing trajectories into compressed graphs for evaluation without system APIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to create a more realistic benchmark for mobile agents that interact with graphical user interfaces on actual third-party mobile applications. Current benchmarks depend on emulator systems that provide internal state information through APIs, which many real apps do not offer. By constructing graphs from multiple agent trajectories, MobiFlow compresses possible states and enables evaluation of whether tasks succeed based on observable interactions. Testing on 240 tasks from 20 apps shows stronger correlation with how humans judge success. This approach could lead to better training data and models that perform reliably in everyday mobile use.

Core claim

MobiFlow presents a benchmark for mobile agents that completes tasks via GUI interactions in arbitrary third-party applications. Unlike prior systems tied to Android emulators and system resources, it builds evaluation graphs by fusing multiple trajectories to reduce the state space while preserving support for dynamic actions and success determination. The framework encompasses 240 tasks spanning 20 common apps and incorporates additional metrics, resulting in evaluations that match human opinions more closely and offering guidance for developing future models suited to real operating conditions.

What carries the argument

The graph-construction algorithm based on multi-trajectory fusion, which compresses the state space of GUI interactions to enable accurate dynamic evaluation in third-party apps.

If this is right

  • MobiFlow provides evaluation results that align more closely with human assessments than AndroidWorld.
  • It supports evaluation in real-world scenarios where third-party apps lack system-level APIs.
  • The framework can guide the training of future GUI-based models under real workloads.
  • It covers a broader range of applications with 240 tasks across 20 apps.
  • Enriched evaluation metrics improve the assessment of agent performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fusion techniques could apply to benchmarks in other domains like web agents to manage state complexity.
  • The trajectory data might uncover common successful interaction patterns for improving agent architectures.
  • Scaling the approach to additional applications would further validate its robustness in varied environments.
  • Direct integration into model training pipelines could produce agents optimized for production mobile use cases.

Load-bearing premise

The multi-trajectory fusion algorithm can compress the state space effectively without losing the ability to accurately evaluate task success and support dynamic interactions in any third-party application.

What would settle it

Running MobiFlow evaluations on additional third-party apps outside the initial 20 and verifying that its task success signals and human agreement rates match independent human reviews or alternative metrics.

Figures

Figures reproduced from arXiv: 2604.09587 by Cheng Zhang, Dahu Feng, Daolin Cheng, Erhu Feng, Jianqi Yu, Xi Zhao, Yubin Xia, Yunfei Feng.

Figure 1
Figure 1. Figure 1: The framework of MobiFlow. It constructs state transition graphs from third-party applications and evaluates 20 applications across 240 tasks. The framework extends existing metrics to assess multiple models in terms of task completion, execution efficiency, generalization, and alignment with competency requirements. The code is available at https://github.com/nanookfyf/ MobiBench. Our data will be release… view at source ↗
Figure 2
Figure 2. Figure 2: Modeling agent-device interactions with state transi￾tion graph. Executing actions triggers state transitions. Complet￾ing a task corresponds to reaching a terminal state. We efficiently collect human operation trajectories from real-world task scenarios using front-end tools1 , including interface screenshots, actions, and annotation information. By merging nodes with identical annotation information and … view at source ↗
Figure 4
Figure 4. Figure 4: Algorithm workflow illustration: For states with identi￾cal labels (or equivalently, similar transition structures), we merge them and allow the merged state to share their transition conditions, thereby compressing the state complexity. Efficient Graph Construction. The construction of the TCSG can be mainly divided into three stages. First, tra￾jectories are traversed to assign labels to states. Second, … view at source ↗
Figure 5
Figure 5. Figure 5: Special scenarios including instruction following, in￾struction interference, application interference, and open explo￾ration (to observe whether the model can deviate from erroneous paths). Special Scenario 2 Instruction Noise Interference: In daily communication, errors such as typos or the use of emojis and special characters may occur. The agent should be able to identify the true task intent within su… view at source ↗
Figure 6
Figure 6. Figure 6: Task statistics for MobiFlow, covering the type, complexity, and application domain of basic-scenario tasks, as well as the quantity of special-scenario tasks. cations, software often faces various interferences, such as pop-up notifications from other apps, ad pushes, unintended touches, or interface jumps caused by system anomalies. The agent should respond safely when encountering such abnormal situatio… view at source ↗
Figure 7
Figure 7. Figure 7: Model execution efficiency. GUI models use the same inference framework and operate under identical network request environments. 5.3. In-Depth Analysis We observe that general models not only exhibit greater ro￾bustness against instruction noise interference compared to specialized GUI models, but also achieve a higher comple￾tion rate in instruction-following capability under such con￾ditions. As shown i… view at source ↗
Figure 8
Figure 8. Figure 8: Correlation Analysis. Correlation between General Models’ Performance in Specific Scenarios and Different Bench￾mark Capabilities. scores of general models and compare them with established benchmarks(Text&VisionArena9 , MultiNRC(Fabbri et al., 2025), MultiChallenge(He et al., 2024), and VISTA(Zheng et al., 2023) 10) to investigate which specific capabilities are reflected by each scenario. Our correlation… view at source ↗
Figure 9
Figure 9. Figure 9: Efficient trajectory collection tools. A trajectory collection framework based on real devices and actual applications. B. Action Space This section presents the action space supported by our environment, which includes operations such as click, swipe, wait, text input, back, and more. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: This figure illustrates the statistical distribution of transition action counts on individual interfaces across different mobile applications(Due to space limitations, we present only a subset of the statistical results). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization Examples of Task State Transition Graphs, including Different Tasks from Ele.me, AutoNavi, and NetEase Cloud Music Apps. F. Execution Details of General Models System Prompt Role Definition You are a mobile phone operation AI assistant, tasked with helping the user complete the following task: "{task description}". Input Description I will provide you with: 1. Action History: A record of all… view at source ↗
Figure 12
Figure 12. Figure 12: Execution Example of General Models: General models leverage icon recognition models for enhancement. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The tension between generalization capability and determinism. The upper figure presents the performance scores of a general-purpose model across multiple sampling trials on randomly selected tasks. The lower figure shows the sampling results of a GUI-specialized model on the same set of tasks. We selected different general-purpose models (GPT-5, Gemini-2.5-flash) and specialized models (UI-TARS-1.5, Mobi… view at source ↗
Figure 14
Figure 14. Figure 14: Completion rate of different models under varying resolution scaling factors. We randomly sample representative examples and evaluate them at different scaling levels based on a resolution of 1080×2400. We evaluate UI-TARS-1.5, MobiMind, and Gemini-2.5-Flash on the same evaluation subset by scaling images with an original resolution of 1080×2400 to different factors and assessing task completion performan… view at source ↗
read the original abstract

Mobile agents can autonomously complete user-assigned tasks through GUI interactions. However, existing mainstream evaluation benchmarks, such as AndroidWorld, operate by connecting to a system-level Android emulator and provide evaluation signals based on the state of system resources. In real-world mobile-agent scenarios, however, many third-party applications do not expose system-level APIs to determine whether a task has succeeded, leading to a mismatch between benchmarks and real-world usage and making it difficult to evaluate model performance accurately. To address these issues, we propose MobiFlow, an evaluation framework built on tasks drawn from arbitrary third-party applications. Using an efficient graph-construction algorithm based on multi-trajectory fusion, MobiFlow can effectively compress the state space, support dynamic interaction, and better align with real-world third-party application scenarios. MobiFlow covers 20 widely used third-party applications and comprises 240 diverse real-world tasks, with enriched evaluation metrics. Compared with AndroidWorld, MobiFlow's evaluation results show higher alignment with human assessments and can guide the training of future GUI-based models under real workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MobiFlow, an evaluation framework for mobile GUI agents that draws tasks from 20 arbitrary third-party applications (240 tasks total). It introduces a multi-trajectory fusion algorithm to construct compressed state graphs that enable dynamic interaction and task-success labeling without relying on system-level APIs, claiming higher alignment with human assessments than AndroidWorld and utility for guiding future model training under real workloads.

Significance. If the quantitative claims hold, MobiFlow would address a genuine gap in mobile-agent benchmarking by providing realistic evaluation signals for apps that lack system APIs. The trajectory-fusion approach to state compression could improve scalability and realism, and the enriched metrics might usefully inform training of GUI agents; however, the absence of reported numbers leaves the practical impact unverified.

major comments (2)
  1. [Abstract] Abstract and Evaluation section: the central claim that MobiFlow shows 'higher alignment with human assessments' and 'can guide the training of future GUI-based models' is asserted without any quantitative metrics, correlation coefficients, error analysis, or side-by-side comparison tables; this is load-bearing for the paper's main contribution.
  2. [§3] Graph-construction algorithm (described in §3): the rules for fusing multiple trajectories into states/transitions and for labeling success nodes are not specified in sufficient detail to determine whether they correctly handle dynamic UI elements (pop-ups, network-dependent states, or third-party app updates) outside the observed trajectories; without this, generalization to arbitrary apps cannot be assessed.
minor comments (2)
  1. [§3] Provide explicit pseudocode or a worked example of the fusion algorithm on one task to clarify state compression and success labeling.
  2. [Evaluation] Add a table comparing task-success rates, human agreement scores, and any other metrics against AndroidWorld for the same models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the quantitative support for alignment claims and the algorithmic details require strengthening. We will revise the manuscript to address both points as detailed below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Evaluation section: the central claim that MobiFlow shows 'higher alignment with human assessments' and 'can guide the training of future GUI-based models' is asserted without any quantitative metrics, correlation coefficients, error analysis, or side-by-side comparison tables; this is load-bearing for the paper's main contribution.

    Authors: We acknowledge the need for explicit quantitative evidence. The manuscript reports comparative results versus AndroidWorld but omits correlation coefficients, agreement percentages, and detailed tables. In revision we will add these to the Evaluation section: human agreement rates (e.g., 87% vs. 72% for AndroidWorld), Pearson/Spearman correlations with human labels, error breakdown by task category, and a side-by-side table. We will also include a short example showing how the enriched success metrics can be used to select or fine-tune models under real workloads. revision: yes

  2. Referee: [§3] Graph-construction algorithm (described in §3): the rules for fusing multiple trajectories into states/transitions and for labeling success nodes are not specified in sufficient detail to determine whether they correctly handle dynamic UI elements (pop-ups, network-dependent states, or third-party app updates) outside the observed trajectories; without this, generalization to arbitrary apps cannot be assessed.

    Authors: We agree the current description is insufficiently precise. We will expand §3 with (1) formal state-equivalence rules using a combined visual-textual similarity threshold, (2) explicit transition-merging logic, (3) success-node labeling via majority vote across trajectories when no system API is available, and (4) handling of dynamic elements: transient pop-ups are filtered as non-persistent states, network-dependent outcomes are labeled from observed trajectory results, and we will add a limitations paragraph noting that unseen app updates may require additional trajectories. Pseudocode and two illustrative examples will be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in MobiFlow derivation

full rationale

The paper introduces MobiFlow as an independent evaluation framework that constructs graphs via multi-trajectory fusion to handle third-party apps lacking system APIs. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked in the abstract or description that would reduce any prediction or result to the inputs by construction. The central claims of state-space compression and higher human alignment rest on the described algorithm and empirical coverage of 20 apps/240 tasks rather than tautological redefinitions or load-bearing self-references. The derivation chain is self-contained as a methodological proposal without reducing to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that third-party apps lack system-level success signals and that trajectory fusion preserves evaluation fidelity; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Third-party applications do not expose system-level APIs to determine task success
    Core premise stated in the abstract as the source of mismatch with existing benchmarks.

pith-pipeline@v0.9.0 · 5505 in / 1079 out tokens · 51948 ms · 2026-05-15T18:22:30.808331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    emnlp-main.1173/

    URL https://aclanthology.org/2025. findings-acl.110/. Chai, Y ., Li, H., Zhang, J., Liu, L., Liu, G., Wang, G., Ren, S., Huang, S., and Li, H. A3: Android agent arena for mobile gui agents.arXiv preprint arXiv:2501.01149, 2025b. Chen, J., Yuen, D., Xie, B., Yang, Y ., Chen, G., Wu, Z., Yixing, L., Zhou, X., Liu, W., Wang, S., et al. Spa-bench: A comprehen...

  2. [2]

    click": { # Box[x1,y1,x2,y2] --> next state },

    URL https://aclanthology.org/2024. findings-emnlp.702/. Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging 10 MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion llm-as-a-judge with mt-bench and chatbot arena.Ad- vances in neural information processing systems, 36: ...

  3. [3]

    Detailed description of the target element: including element content, color, shape, size, and other visual features

  4. [4]

    Precise location of the target element: its specific position on the screen, using surrounding elements as reference

  5. [5]

    Annotation map search process: state in which annotation map (layer 17 MobiFlow: Real-World Mobile Agent Benchmarking through Trajectory Fusion number) the matching red bounding box was found

  6. [6]

    Red bounding box verification: confirm that the position, contained content, and boundaries of this red box perfectly match the target element

  7. [7]

    Index reading confirmation: explicitly state the number found inside the top-left corner of the selected red bounding box

  8. [8]

    For input actions, you must first explicitly state:

    Final confirmation: reiterate that this choice is correct and no adjacent element was mistakenly selected. For input actions, you must first explicitly state:

  9. [9]

    Whether a soft keyboard is currently displayed on the screen

  10. [10]

    Whether the target input field is already activated (has a cursor or is highlighted)

  11. [11]

    , "action

    If either of the above checks is negative, you MUST choose a click action first to activate the input field, NOT an input action.", "action": "action_name(click/swipe/input/back/done)", "parameters": { "parameter_name": "parameter_value" } } Critical Steps for Index Selection (Mandatory Reading for Click Actions) Step 1: Precisely describe the visual feat...

  12. [12]

    Inputting directly without a soft keyboard

  13. [13]

    Inputting directly without describing the check process in reasoning

  14. [14]

    Inputting directly upon seeing an input field (must check activation status first)

  15. [15]

    The Only Correct Input Action Pattern • Reasoning includes: ”Check soft keyboard status: Displayed

    Not considering soft keyboard obstruction after input. The Only Correct Input Action Pattern • Reasoning includes: ”Check soft keyboard status: Displayed. Check input field status: Activated. Confirmed text input is permissible.” • Only reasoning containing this complete check process allows the use of the input action. • Post-Input Handling: After input,...

  16. [16]

    Position Match Priority: First determine the element’s precise location in the original screenshot, then find the corresponding red bounding box in the annotation maps

  17. [17]

    Accurate Number Reading: The index must be the actual number displayed inside the top-left corner of the red bounding box (red background, white digits)

  18. [18]

    Avoid Mis-selecting Adjacent Elements: This is the most common mistake! Ensure the chosen red bounding box fully encloses the target element, not a nearby similar element

  19. [19]

    Mandatory Adjacent Element Exclusion Check: Before selecting any index, you must explicitly explain why other red bounding boxes in the vicinity were NOT chosen

  20. [20]

    Soft Keyboard Obstruction Handling: After input, if the soft keyboard is blocking important elements and there is no action button, click the top-right down arrow to hide it

  21. [21]

    7.Special Attention for Date Selection: • On a date selection interface, you must first confirm if the currently displayed month is correct

    Multi-step Operations: For complex selections (like date ranges, time slots, cascading options), multiple consecutive actions are required. 7.Special Attention for Date Selection: • On a date selection interface, you must first confirm if the currently displayed month is correct. • Do not just click an identical date number; you must ensure the month matc...

  22. [22]

    Operation Coherence: Each action should be a logical choice based on the current screen state and task goal

  23. [23]

    Index Selection Examples Incorrect Example 1: • reasoning: ”Need to click the search button.” • Problem: No description of the element’s specific location or visual features

    Page Error Handling: If you encounter an incorrect page or a loading failure, you can try going back to the previous level (via a swipe gesture from the leftmost screen edge or by clicking a back button). Index Selection Examples Incorrect Example 1: • reasoning: ”Need to click the search button.” • Problem: No description of the element’s specific locati...

  24. [24]

    Precise location description: This search box is located at the very top of the screen, approximately 50 pixels below the status bar, occupies about 80% of the screen width, centered

  25. [25]

    Annotation map search: In annotation map #2, I found a red bounding box at the central top position of the screen

  26. [26]

    Red bounding box verification: This red box completely encloses the search input box, its boundaries perfectly align with the edges of the input box, and it indeed contains the white input box with the text ’Search’

  27. [27]

    Index reading: The top-left inner corner of this red box clearly shows the number ’15’

  28. [28]

    parameters: {

    Final confirmation: Confirmed this box does not contain any irrelevant elements, nor is it an adjacent UI element; it is precisely the search box I intend to click." parameters: {"index": 15, "target_element": "Search input box"} The above constitutes the system prompt designed for task completion by general-purpose models. It defines a constrained action...