VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics
Pith reviewed 2026-05-16 06:13 UTC · model grok-4.3
The pith
VenusBench-Mobile shows state-of-the-art mobile GUI agents suffer large performance gaps and remain far from reliable real-world deployment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VenusBench-Mobile builds two evaluation pillars: user-intent-driven task design that mirrors real mobile usage and a capability-oriented annotation scheme that supports fine-grained analysis of agent actions. When applied to leading mobile GUI agents, the benchmark reveals substantially lower performance than earlier tests, with failures dominated by perception and memory deficiencies; even top agents achieve near-zero success under environment variations, showing they are not yet ready for reliable real-world use.
What carries the argument
The two core pillars of user-intent-driven task design, which selects tasks from real usage patterns, and the capability-oriented annotation scheme, which tags outcomes by specific skills such as perception and memory to enable diagnostic breakdown.
If this is right
- Current agents exhibit large performance gaps and are far from reliable real-world deployment.
- Failures are dominated by deficiencies in perception and memory that coarse evaluations hide.
- Even the strongest agents reach near-zero success under environment variations.
- The benchmark supplies a stepping stone for developing agents that handle realistic conditions.
Where Pith is reading between the lines
- Advancing agent perception modules and memory mechanisms could directly address the main failure sources identified.
- The same user-intent and capability-diagnostic approach could be applied to create benchmarks for web or desktop GUI agents.
- Capability-level diagnostics may help researchers prioritize fixes for specific weaknesses rather than overall task success.
Load-bearing premise
The chosen tasks and annotations accurately reflect the full range of real-world mobile usage without adding their own selection biases.
What would settle it
If leading agents achieve success rates on VenusBench-Mobile that match or exceed their rates on prior benchmarks, or if controlled tests show perception and memory are not the primary failure modes.
Figures
read the original abstract
Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VenusBench-Mobile, an online benchmark for mobile GUI agents that uses user-intent-driven task design to capture real-world diversity and instability, paired with a capability-oriented annotation scheme for fine-grained diagnostics. Evaluations of state-of-the-art agents show large performance gaps versus prior benchmarks, with failures dominated by perception and memory deficiencies and near-zero success under environment variations; code and data are released.
Significance. If the realism claims hold, the work is significant for exposing limitations of current mobile GUI agents more effectively than app-centric benchmarks and for providing diagnostic tools to guide improvements. The open release of code and data is a clear strength that supports reproducibility and community use in HCI and agent research.
major comments (2)
- [Benchmark construction / task design] The user-intent-driven task design section asserts that selected tasks reflect real mobile usage diversity without providing quantitative validation (e.g., distribution comparisons or statistical tests against external usage corpora). This assumption is load-bearing for the central claim that observed performance gaps and brittleness demonstrate inherent agent limitations rather than curation effects.
- [Evaluation and diagnostics] The capability-oriented annotation scheme is used to diagnose perception and memory failures, but the manuscript lacks details on the annotation protocol, guidelines, or inter-annotator agreement metrics. This undermines the reliability of the fine-grained diagnostic findings reported in the evaluation section.
minor comments (2)
- [Abstract] The abstract would benefit from including concrete numbers (e.g., total tasks, agents evaluated, or average task length) to immediately contextualize the scale of the benchmark.
- [Results tables] Tables reporting success rates should include statistical significance tests or confidence intervals for the claimed performance gaps relative to prior benchmarks.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We appreciate the referee's emphasis on strengthening the validation of task design and the transparency of the annotation process. We address each major comment below and will revise the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Benchmark construction / task design] The user-intent-driven task design section asserts that selected tasks reflect real mobile usage diversity without providing quantitative validation (e.g., distribution comparisons or statistical tests against external usage corpora). This assumption is load-bearing for the central claim that observed performance gaps and brittleness demonstrate inherent agent limitations rather than curation effects.
Authors: We agree that quantitative validation would further support the claim of real-world representativeness. In the revised manuscript, we will add a new subsection detailing how tasks were derived from real mobile usage patterns (drawing from public app usage reports and HCI literature) and include distribution comparisons (e.g., category frequencies) against external corpora such as those from Statista or academic mobile usage studies, along with basic statistical measures to quantify alignment. revision: yes
-
Referee: [Evaluation and diagnostics] The capability-oriented annotation scheme is used to diagnose perception and memory failures, but the manuscript lacks details on the annotation protocol, guidelines, or inter-annotator agreement metrics. This undermines the reliability of the fine-grained diagnostic findings reported in the evaluation section.
Authors: We acknowledge that additional details are needed for reproducibility and credibility. In the revision, we will expand the relevant section to describe the full annotation protocol, provide the complete guidelines given to annotators, and report inter-annotator agreement metrics (e.g., Cohen's kappa and raw agreement percentages) computed on a subset of tasks. revision: yes
Circularity Check
No significant circularity in empirical benchmark construction
full rationale
This is an empirical benchmark paper introducing VenusBench-Mobile via user-intent-driven task design and capability-oriented annotations. The central claims rest on external evaluations of existing agents showing performance gaps, brittleness under variations, and failure modes in perception/memory; these observations are not derived from internal equations or self-referential definitions. No fitted parameters are called predictions, no uniqueness theorems are imported via self-citation, and no ansatz or renaming reduces results to inputs by construction. The assumption that the design reflects real usage is presented as a premise supported by the released code/data and observed results rather than a closed loop. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption User-intent-driven tasks reflect the diversity and instability of real-world mobile usage
Reference graph
Works this paper leans on
-
[1]
All key factual values (numbers, dates, names, file names, paths, etc.) are identical
-
[2]
The core assertion described is factually equivalent|even if phrased differently
-
[3]
No content beyond the ground truth answer is allowed; otherwise, it will be considered incorrect. User Template: TASK GOAL:{goal} PREDICTED ANSWER:{prediction} GROUND-TRUTH ANSWER:{gt answer} Is the predicted answer match the ground-truth answer? Illustrative Examples.To demonstrate the behavior of this evaluation interface in practice, we present a repre...
-
[4]
For functionality introductions, mark INCORRECT if the prediction covers less than half of the ground-truth functionalities or includes any features not present in the ground truth
-
[5]
Answer ‘‘yes’’ only if the prediction supports most ground-truth facts and does not introduce contradictions
-
[6]
Answer ‘‘no’’ if it misses critical facts that cause misunderstanding or asserts details that conflict with the ground truth. User Template: TASK GOAL:{goal} PREDICTED FUNCTIONALITIES:{prediction} GROUND-TRUTH FUNCTIONALITIES:{gt answer} Does the predicted list accurately cover all functionalities listed in the ground-truth? Illustrative Examples.To demon...
-
[7]
Task Description & Evaluation Logic: This interface is designed for CF tasks, where agents must identify and resolve contradictions or ambiguities between user instructions and the actual GUI state. In such scenarios, the agent is expected to recognize that the task is unachievable and perform a “Call User” action to prompt for respecification. The evalua...
-
[8]
Judge whether the PREDICTION semantically affirms most of the facts expressed in the GROUND-TRUTH answer
-
[9]
The prediction cannot be empty
-
[10]
Answer ‘‘no’’ if the prediction misses important ground-truth facts which might cause misunderstanding, contradicts the ground truth, or asserts successful execution of a task that should be refused. User Template: TASK GOAL:{goal} PREDICTED ANSWER:{prediction} GROUND-TRUTH ANSWER:{gt answer} Is the predicted answer match the ground-truth answer? Illustra...
-
[11]
The task goal description,
-
[12]
Carefully examine the screenshot to check if the content matches the goal requirements
The final screenshot. Carefully examine the screenshot to check if the content matches the goal requirements
-
[13]
Consider whether the main intent is fulfilled and whether key elements mentioned in the goal are present. User Template: VIDEO TASK GOAL:{goal} = FINAL SCREENSHOT = {image placeholder} Based on the screenshot above, has the task been completed according to the goal? Illustrative Examples.To illustrate how this goal-based verification distinguishes success...
-
[14]
Task Description & Evaluation Logic: This interface is tailored for drawing tasks within the GUIM category. Unlike functional automation, drawing tasks yield highly diverse visual outputs with no unique standard answer. Consequently, this interface does not utilize a static ground-truth image. Instead, an MLLM-as-a-Judge evaluates the final canvas against...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.