GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

Jitao Sang; Xiaoyi Chen; Xingxing Song; Yang Xu; Yifei Gao; Yi Zhang

arxiv: 2605.29532 · v1 · pith:EPCUDCVQnew · submitted 2026-05-28 · 💻 cs.SE · cs.AI

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

Xiaoyi Chen , Yifei Gao , Yang Xu , Xingxing Song , Yi Zhang , Jitao Sang This is my paper

Pith reviewed 2026-06-29 06:46 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords GUI testingexploratory testingMLLM agentsopen-set evaluationdefect detectionAndroid applicationsprocess-aware evaluationbenchmark

0 comments

The pith

GUIJudge decomposes exploratory GUI testing trajectories into diagnosable capabilities for open-set evaluation that outperforms annotation-bound methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GUITestScape as an interactive benchmark spanning 61 real Android applications and 508 preset defects of both interaction and display types. It introduces GUIJudge as an evaluator that breaks down an agent's full testing process rather than judging only the final state against fixed labels. This setup matters because prior benchmarks ignore display defects and collapse distinct failure modes into one score. Experiments indicate GUIJudge delivers more reliable process-aware judgments than baselines and that its verifiers can be added to existing agents to raise detection rates without any retraining. The work positions detection itself as the persistent weak point across models on both defect categories.

Core claim

GUIJudge achieves reliable process-aware evaluation beyond predefined annotations, substantially outperforming all baselines. Benchmarking on GUITestScape reveals that detection remains the critical bottleneck for existing models across both defect types, and that integrating GUIJudge's verifiers into existing agents significantly boosts their detection performance without retraining.

What carries the argument

GUIJudge, an open-set evaluator that decomposes an agent's testing trajectory into independently diagnosable capabilities

If this is right

Detection is the primary performance limiter for current models on both interaction and display defects.
Adding GUIJudge verifiers raises detection performance of existing agents without any retraining.
Evaluation can separate qualitatively different failure modes instead of reducing testing to a single end-state score.
Benchmarks must include display defects alongside interaction defects to reflect realistic exploratory testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar trajectory decomposition could apply to other agent evaluation settings where intermediate steps matter more than final outcomes.
Future work might test whether the same verifiers improve agents on entirely new application domains not seen in the 61-app set.
Agents could be trained to internalize these capability checks rather than relying on post-hoc addition.

Load-bearing premise

The 508 preset defects across the 61 apps and the capability decomposition performed by GUIJudge are representative of real exploratory testing behavior and do not introduce selection bias that would make the reported gains specific to this benchmark.

What would settle it

A new collection of apps and defects outside the 508 presets on which GUIJudge's advantage shrinks or disappears, or on which adding its verifiers fails to improve agent detection rates.

Figures

Figures reproduced from arXiv: 2605.29532 by Jitao Sang, Xiaoyi Chen, Xingxing Song, Yang Xu, Yifei Gao, Yi Zhang.

**Figure 2.** Figure 2: Construction pipeline of GUITestScape. Each real-world defect is instantiated into an evaluation case [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Hierarchical distribution of preset defects [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: GUIJudge Workflow. Given an agent trajec [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Overall Recall and F1 trends across Reach, Trigger, and Detect for general models and agent models on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Example of element layout defect. The task requires opening the settings page via the navigation drawer. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Example of content rendering defect. The task requires opening the settings page and then tapping [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Example of operation no response defect. The task requires opening the search field, typing “Exercise”, [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Example of navigation logic error defect. The task requires inspecting the [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Example of unexpected task result defect. The task requires switching the [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously navigate an application and discover defects through its own interaction. However, current evaluation falls short on two fronts. First, existing benchmarks focus almost exclusively on interaction defects, leaving display defects outside the evaluation frame. Second, evaluation protocols are bound to predefined defect annotations, collapsing the testing process into a single end-state judgment that conflates qualitatively distinct failure modes. To address these challenges, we present GUITestScape, an interactive benchmark covering 61 real-world Android applications and 508 preset defects spanning interaction and display types, and introduce GUIJudge, an open-set evaluator that decomposes an agent's testing trajectory into independently diagnosable capabilities. Experimental results demonstrate that GUIJudge achieves reliable process-aware evaluation beyond predefined annotations, substantially outperforming all baselines. Benchmarking on GUITestScape further reveals that detection remains the critical bottleneck for existing models across both defect types, and that integrating GUIJudge's verifiers into existing agents significantly boosts their detection performance without retraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GUITestScape and GUIJudge add a benchmark with display defects and a process-decomposing evaluator, but preset defects leave generalizability unclear.

read the letter

The main thing to know is that the paper introduces GUITestScape, a benchmark of 61 Android apps with 508 preset defects that now includes display defects alongside interaction ones, and GUIJudge, an evaluator that splits an agent's trajectory into separate capabilities for open-set judgment instead of relying only on end-state annotations.

This fills a practical gap. Exploratory GUI testing with MLLM agents needs evaluation that handles both defect types and distinguishes failure modes during the process, and the work supplies concrete artifacts for that. The reported results show GUIJudge beating baselines on process-aware scoring and that adding its verifiers lifts detection performance in existing agents without retraining. Detection is called out as the persistent weak point across models.

The soft spot is the representativeness of the 508 defects and the capability breakdown. If those defects were chosen or the decomposition shaped to suit the proposed verifiers, the outperformance claims and the bottleneck conclusion could be tied to this specific setup rather than reflecting broader exploratory testing. The abstract gives no details on selection criteria or external checks for real-world match, so that assumption carries the central claims.

This is for people building or evaluating GUI agents in software engineering. Readers who care about benchmark design for autonomous testing would find the new artifacts worth examining. It deserves peer review because the artifacts are substantial enough to get feedback on, even if the experiments need more on how the defects were picked and whether the gains hold elsewhere.

Referee Report

2 major / 0 minor

Summary. The paper introduces GUITestScape, an interactive benchmark spanning 61 real-world Android applications and 508 preset defects of both interaction and display types, along with GUIJudge, an open-set evaluator that decomposes agent trajectories into independently diagnosable capabilities. It claims that GUIJudge enables reliable process-aware evaluation beyond predefined annotations and substantially outperforms baselines; benchmarking further shows detection as the critical bottleneck across defect types and that integrating GUIJudge verifiers boosts existing agents' detection performance without retraining.

Significance. If the experimental claims hold under rigorous validation, the work would advance evaluation methodology for MLLM agents in exploratory GUI testing by expanding beyond interaction-only and annotation-bound protocols, while offering a practical no-retraining improvement path. The identification of detection as bottleneck could usefully direct future agent development.

major comments (2)

[Abstract / §3] Abstract and benchmark construction (likely §3): the central claims of reliable open-set evaluation, outperformance, detection bottleneck, and no-retraining gains rest on the assumption that the 508 preset defects and GUIJudge's capability decomposition are representative of real exploratory testing behavior; no independent validation, inter-rater agreement, or comparison to naturally occurring defects is described, leaving open the possibility that reported gains are benchmark-specific.
[Abstract / §4] Experimental results (likely §4 and §5): the abstract asserts experimental superiority and bottleneck findings but the provided text supplies no information on experimental setup details, choice of baselines, statistical tests, variance across runs, or how the 508 defects were sampled/verified, preventing assessment of whether the support for the claims is adequate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional rigor and clarity would strengthen the manuscript. We respond to each major comment below, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract / §3] Abstract and benchmark construction (likely §3): the central claims of reliable open-set evaluation, outperformance, detection bottleneck, and no-retraining gains rest on the assumption that the 508 preset defects and GUIJudge's capability decomposition are representative of real exploratory testing behavior; no independent validation, inter-rater agreement, or comparison to naturally occurring defects is described, leaving open the possibility that reported gains are benchmark-specific.

Authors: We agree that explicit validation of representativeness would strengthen the claims. Section 3 describes the curation of the 508 defects from 61 real-world Android apps, drawing on documented common defect patterns for both interaction and display types. However, no formal inter-rater agreement study or direct comparison against a corpus of naturally occurring defects was performed in the submitted version. In revision we will add a subsection detailing the multi-author verification process used during defect creation and will include a limited post-hoc comparison against a small set of real user-reported defects where feasible. The open-set design of GUIJudge is intended to mitigate benchmark-specificity by evaluating capability decomposition independently of the presets; the reported outperformance is measured on held-out trajectories within GUITestScape. revision: partial
Referee: [Abstract / §4] Experimental results (likely §4 and §5): the abstract asserts experimental superiority and bottleneck findings but the provided text supplies no information on experimental setup details, choice of baselines, statistical tests, variance across runs, or how the 508 defects were sampled/verified, preventing assessment of whether the support for the claims is adequate.

Authors: The reviewed version omitted explicit experimental details. The complete manuscript in Sections 4 and 5 specifies the baselines (standard MLLM agents plus rule-based and annotation-bound evaluators), reports results across multiple independent runs with variance, applies statistical significance testing, and states that all 508 defects were exhaustively included rather than sampled. We will expand the experimental setup subsection to present these elements clearly and comprehensively in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new benchmark (GUITestScape with 61 apps and 508 preset defects) and an evaluator (GUIJudge) that decomposes trajectories into capabilities. No equations, fitted parameters, predictions, or self-citations appear in the provided text that reduce any central claim to its own inputs by construction. Claims of outperformance and detection bottlenecks rest on experimental comparisons to baselines on the introduced benchmark, which is a standard non-circular approach for new evaluation frameworks. The representativeness concern is an external validity issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, parameters, or modeling choices that can be audited for free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5731 in / 1044 out tokens · 27430 ms · 2026-06-29T06:46:58.535828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · 3 internal anchors

[1]

Mind2Web: Towards a Generalist Agent for the Web

Mind2web: Towards a generalist agent for the web.Preprint, arXiv:2306.06070. Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, and Jitao Sang. 2026a. Guitester: Enabling gui agents for exploratory defect discovery.Preprint, arXiv:2601.04500. Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, and...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Ui-tars: Pioneering automated gui interac- tion with native agents, 2025.URL https://arxiv. org/abs/2501.12326. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice 9 Li, William Bishop, Wei Li, Folawiyo Campbell- Ajala, Daniel Toyama, Robert Berry, Divya Tya- magundlu, Timothy Lillicrap, and Oriana Riva

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Androidworld: A dynamic benchmarking environment for autonomous agents.Preprint, arXiv:2405.14573. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell- Ajala, and 1 others. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024.URL ht...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

- Historical content serves as context only; old issues from previous steps must not be directly treated as defects of the current step

Judge only thecurrent step. - Historical content serves as context only; old issues from previous steps must not be directly treated as defects of the current step. - A defect can be attributed to the current step only when this step clearly caused, propagated and exposed, or reconfirmed the issue
[5]

Do not classify agent mistakes as app defects. - Clicking the wrong target, hit=false, input that does not satisfy the task requirements, missing required steps, misunderstanding the task, or never actually reaching the intended control is typicallynotan app interaction defect. - A defect can be assigned only when evidence shows thateven though the curren...
[6]

has_defect

Decide strictly based on evidence. - Use only the provided textual trace and the current step’spre/postscreenshots. - Do not speculate about hidden flows, background states, undisplayed toasts, or network results that were not provided. Step Number Rules - If a defect is identified,stepmust be the step number of the current step under verification. - If n...

[1] [1]

Mind2Web: Towards a Generalist Agent for the Web

Mind2web: Towards a generalist agent for the web.Preprint, arXiv:2306.06070. Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, and Jitao Sang. 2026a. Guitester: Enabling gui agents for exploratory defect discovery.Preprint, arXiv:2601.04500. Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, and...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Ui-tars: Pioneering automated gui interac- tion with native agents, 2025.URL https://arxiv. org/abs/2501.12326. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice 9 Li, William Bishop, Wei Li, Folawiyo Campbell- Ajala, Daniel Toyama, Robert Berry, Divya Tya- magundlu, Timothy Lillicrap, and Oriana Riva

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Androidworld: A dynamic benchmarking environment for autonomous agents.Preprint, arXiv:2405.14573. Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell- Ajala, and 1 others. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024.URL ht...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

- Historical content serves as context only; old issues from previous steps must not be directly treated as defects of the current step

Judge only thecurrent step. - Historical content serves as context only; old issues from previous steps must not be directly treated as defects of the current step. - A defect can be attributed to the current step only when this step clearly caused, propagated and exposed, or reconfirmed the issue

[5] [5]

Do not classify agent mistakes as app defects. - Clicking the wrong target, hit=false, input that does not satisfy the task requirements, missing required steps, misunderstanding the task, or never actually reaching the intended control is typicallynotan app interaction defect. - A defect can be assigned only when evidence shows thateven though the curren...

[6] [6]

has_defect

Decide strictly based on evidence. - Use only the provided textual trace and the current step’spre/postscreenshots. - Do not speculate about hidden flows, background states, undisplayed toasts, or network results that were not provided. Step Number Rules - If a defect is identified,stepmust be the step number of the current step under verification. - If n...