WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements

Duc-Minh Nguyen; Jin Song Dong; Ruofei Ren; Wenjie Zhang; Xiwen Teoh; Yun Lin

arxiv: 2602.11724 · v3 · submitted 2026-02-12 · 💻 cs.SE

WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements

Xiwen Teoh , Yun Lin , Duc-Minh Nguyen , Ruofei Ren , Wenjie Zhang , Jin Song Dong This is my paper

Pith reviewed 2026-05-16 05:59 UTC · model grok-4.3

classification 💻 cs.SE

keywords web testingLLM agentsnatural language specificationsoracle inferenceGUI symbolizationend-to-end testingbug detectionhallucination mitigation

0 comments

The pith

WebTestPilot uses a symbolization layer on GUI elements to infer pre- and post-condition oracles that let LLM agents test web apps against natural language specifications while separating hallucinations from real bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WebTestPilot as an LLM-based agent for end-to-end web testing from natural language requirements. It adds a symbolization layer that turns critical GUI elements into variables and then derives pre- and post-conditions for each test step to serve as an implicit oracle. These conditions capture data, temporal, and causal dependencies across steps so the agent can validate requirements that would otherwise be missed by simple navigation checks or isolated state verification. The method is evaluated on a new benchmark of web applications with injected bugs, where it reaches 99 percent task completion and 96 percent precision and recall in bug detection. The results hold across varied natural language inputs and different model sizes.

Core claim

WebTestPilot is an LLM-based agent for end-to-end web testing that first detects and symbolizes critical GUI elements into variables and then translates a natural language specification into a sequence of steps, each equipped with inferred pre- and post-conditions over those symbols. These oracles capture dependencies that allow the agent to act as its own validator and distinguish inconsistencies caused by model hallucinations from genuine application bugs. Existing approaches either accept any crash-free navigation or examine states in isolation and therefore miss context-dependent failures.

What carries the argument

Symbolization layer that converts critical GUI elements into variables, paired with inference of pre- and post-conditions over those variables to form per-step oracles.

If this is right

LLM agents can now perform reliable end-to-end testing against natural language specifications without needing manually written oracles.
Context-dependent bugs that span multiple steps become detectable because oracles track data, temporal, and causal dependencies.
The same agent generalizes across different natural language phrasings and across model scales without retraining.
A reproducible benchmark of bug-injected web applications now exists for systematic comparison of NL-to-E2E testing methods.
The approach directly addresses the hallucination problem that previously made LLM agents untrustworthy as oracles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same symbolization-plus-oracle pattern could be adapted to mobile or desktop interfaces where GUI elements are also extractable as structured variables.
Adding an explicit symbolic layer may reduce error propagation in other long-horizon LLM agent tasks that require consistent state tracking.
The benchmark could serve as a test bed for comparing LLM agents against traditional scripted or model-based testing frameworks.
If symbolization accuracy improves with better vision models, overall bug detection rates could rise further without changes to the oracle logic.

Load-bearing premise

The symbolization layer must correctly identify the critical GUI elements, and the inferred pre- and post-conditions must accurately capture the implicit requirements without overlooking context-dependent failures.

What would settle it

A web application and natural language specification where WebTestPilot either reports a bug that is not present in the code or fails to report a real bug that violates the specification because the symbolization or oracle inference missed the relevant dependency.

Figures

Figures reproduced from arXiv: 2602.11724 by Duc-Minh Nguyen, Jin Song Dong, Ruofei Ren, Wenjie Zhang, Xiwen Teoh, Yun Lin.

**Figure 2.** Figure 2: A test flow depicting search, shopping, and checkout on e-commerce platform [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Definition of the Product and Cart symbols, represented as Pydantic schemas. WebTestPilot then instantiates these schemas with values extracted from the current and prior states. By referencing page reidentification, it recognizes that State (a) and State (f) correspond to the Cart page and learns a high-level overview of its layout (e.g., the cart contains a list of items, each displaying specific informa… view at source ↗

**Figure 4.** Figure 4: Extracting the added product from product page and comparing current and prior cart details. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Example (from motivating scenario): WebTestPilot extracts symbols via declared schemas that correspond to GUI elements for making assertions on the application state. # Verify products (title, quantity, price) consistency for prod in prior + [added]: match = next((p for p in current if p.title == prod.title), None) assert match is not None, f"Product {prod.title} missing in current cart" assert match.quan… view at source ↗

**Figure 5.** Figure 5: Assertion generated by WebTestPilot. 3 Problem Statement Preliminary. We model a web application W as a graph of states 𝑠 ∈ S. Each state is defined as a tuple 𝑠 = (screenshot, DOM), where screenshot encodes the visual appearance of the page, and DOM is a rooted, ordered tree of UI elements 𝑒, where each element encodes its type (i.e., button, input), relevant attributes (e.g., name, value, enabled/disable… view at source ↗

**Figure 7.** Figure 7: Visualization of the problem statement’s input and output. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Overview of WebTestPilot. WebTestPilot parses a natural language requirement into structured steps (Input Parsing), each specifying a condition, action, and expectation. For each step, it performs Oracle Inference to generate predicate assertions over symbols capturing explicit and implicit requirements. During Oracle Execution, it checks preconditions, executes the action, checks postconditions. Failed as… view at source ↗

**Figure 9.** Figure 9: BNF syntax of DSL for writing test assertions [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 11.** Figure 11: An example test assertion for a test step. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Example of transformed test requirements. Original text: “ [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Performance of different models (RQ4) under different transformed input requirements (RQ3). [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

read the original abstract

Visual language model (VLM) agents show great promise in automating end-to-end (E2E) web testing against requirements in natural language. However, the probabilistic nature of language models can have inherent hallucinations. Therefore, given a detected inconsistency between the requirement and the web application, it is hard to distinguish whether it stems from the hallucination or a real application bug. Addressing this issue presents two core technical challenges: the implicit oracle inference challenge, where the agent must act as its own oracle to implicitly decide if the application's behavior is correct without guidance, and the probabilistic inference challenge, where an LLM's inconsistent reasoning undermines its trustworthiness as an oracle. Existing LLM-based approaches fail to capture such implicit oracles, either by treating any page navigation that doesn't crash as a success, or by checking each state in isolation, thus missing bugs dependent on context from prior steps. We introduce WebTestPilot, an LLM-based agent designed to address these challenges. WebTestPilot uses (1) a symbolization layer which detects and symbolizes critical GUI elements on the web application into symbols (i.e., variables) and (2) translates natural language specification into a sequence of steps, each of which is equipped with inferred pre- and post-conditions over the symbols as an oracle. This oracle captures data, temporal, and causal dependencies, enabling the validation of implicit requirements. To advance research in this area, we build a benchmark of bug-injected web apps for evaluating NL-to-E2E testing. The results show that WebTestPilot achieves a task completion rate of 99%, with 96% precision and 96% recall in bug detection, outperforming the best baseline (+70 precision, +27 recall). The agent generalizes across diverse natural language inputs and model scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WebTestPilot adds a GUI symbolization layer plus explicit pre/post-condition oracles to let LLMs serve as their own testers, but the headline 96% precision/recall numbers rest on unshown details about how the symbols are built and how the benchmark bugs were injected.

read the letter

The main thing here is a practical way to turn natural-language specs into testable oracles for web apps. Instead of letting the LLM just wander and call success on any non-crashing page, the method first turns visible GUI elements into named symbols, then writes pre- and post-conditions over those symbols for each step. That gives the agent something concrete to check against, which should catch context-dependent bugs that current LLM testers miss. The paper also ships a new benchmark of bug-injected web apps, which is useful even if the numbers need checking. Those two pieces—symbolization plus condition inference—are the actual addition over the baselines mentioned in the abstract. The reported 99% task completion and 96% precision/recall look strong on paper and the generalization claim across models is worth testing. The soft spots are exactly where the stress-test note points: no separate accuracy number for the symbolization step, no ablation that removes it, and no description of how the injected bugs were chosen or verified. If symbolization misfires on dynamic widgets or state changes, the oracles become invalid and the big gains disappear. The abstract also gives no statistical tests or error breakdown, so it is hard to know whether the +70 precision lift is real or tied to the particular benchmark construction. This is the kind of work that belongs in a reading group once the full methods and artifacts are available. A serious referee should see it, because the core idea is clear and the problem it targets is real, but the current write-up leaves too many moving parts unexamined for anyone to rely on the numbers yet.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces WebTestPilot, an LLM-based agent for end-to-end web testing from natural language specifications. It uses a symbolization layer to map GUI elements to stable symbols and infers pre- and post-conditions over those symbols to act as oracles that capture data, temporal, and causal dependencies. This is intended to distinguish real bugs from model hallucinations. The authors construct a new benchmark of bug-injected web apps and report 99% task completion, 96% precision, and 96% recall in bug detection, outperforming the strongest baseline by +70 precision and +27 recall while generalizing across NL inputs and model scales.

Significance. If the performance claims hold after proper validation, the work would advance automated web testing by providing a concrete mechanism for implicit oracle inference that existing methods lack. The symbolization-plus-oracle approach and the new benchmark are useful contributions that could support follow-on research. The reported generalization across model scales is a positive signal for practical deployment.

major comments (3)

[Evaluation] Evaluation section: benchmark construction details—including bug injection procedure, how ground-truth oracles are established, and selection criteria for the injected bugs—are not provided. These details are load-bearing for the 96% precision/recall claims, as the metrics cannot be interpreted without knowing whether the injected bugs are representative or whether the evaluation inadvertently favors the proposed oracle inference.
[Method] Method section (symbolization and oracle inference): no accuracy metric, error analysis, or ablation is reported for the symbolization layer itself. Because the central claim rests on the assumption that symbolization reliably extracts critical elements and enables valid pre/post-condition inference, the absence of this analysis leaves the source of the +70/+27 gains unclear.
[Results] Results section: no statistical tests, run-to-run variance, or breakdown of failure cases (e.g., symbolization errors vs. oracle mis-inference) are supplied for the headline metrics. This omission prevents assessment of whether the reported superiority over baselines is robust.

minor comments (2)

[Abstract] Abstract: the baseline comparison states absolute gains but does not name the strongest baseline or report its absolute scores, making the improvement harder to contextualize.
[Method] Notation: the mapping from GUI elements to symbols is described at a high level; a small example showing an actual page state, the extracted symbols, and the resulting pre/post-conditions would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript requires additional details and analyses to support the claims and will revise accordingly.

read point-by-point responses

Referee: [Evaluation] Evaluation section: benchmark construction details—including bug injection procedure, how ground-truth oracles are established, and selection criteria for the injected bugs—are not provided. These details are load-bearing for the 96% precision/recall claims, as the metrics cannot be interpreted without knowing whether the injected bugs are representative or whether the evaluation inadvertently favors the proposed oracle inference.

Authors: We agree that the benchmark details are insufficient. In the revised manuscript we will add a dedicated subsection in Evaluation describing the bug injection procedure (with concrete examples of data, temporal, and causal bugs), the process for establishing ground-truth oracles via independent expert annotation of each test case, and the selection criteria used to ensure the injected bugs are representative of real-world web faults and not biased toward our oracle mechanism. revision: yes
Referee: [Method] Method section (symbolization and oracle inference): no accuracy metric, error analysis, or ablation is reported for the symbolization layer itself. Because the central claim rests on the assumption that symbolization reliably extracts critical elements and enables valid pre/post-condition inference, the absence of this analysis leaves the source of the +70/+27 gains unclear.

Authors: We acknowledge the absence of direct validation for the symbolization layer. We will augment the Method section with (1) an accuracy metric for symbolization on a held-out set of pages, (2) a qualitative error analysis of common failure modes, and (3) an ablation that removes the symbolization layer to quantify its contribution to the observed gains over baselines. revision: yes
Referee: [Results] Results section: no statistical tests, run-to-run variance, or breakdown of failure cases (e.g., symbolization errors vs. oracle mis-inference) are supplied for the headline metrics. This omission prevents assessment of whether the reported superiority over baselines is robust.

Authors: We will strengthen the Results section by adding statistical significance tests (e.g., McNemar’s test for paired comparisons), reporting run-to-run variance obtained by re-executing the experiments with different random seeds, and providing a breakdown of failure cases categorized by source (symbolization errors, oracle mis-inference, navigation failures, etc.). These additions will allow readers to assess the robustness of the +70 precision and +27 recall improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an LLM-based agent with a symbolization layer and inferred pre/post-condition oracles, evaluated empirically on a newly constructed benchmark of bug-injected apps. No equations, fitted parameters, or self-citations are presented that reduce the reported 99% task completion or 96% precision/recall metrics to inputs by construction. The performance claims rest on external evaluation rather than definitional equivalence or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that symbolization plus inferred conditions can reliably separate hallucinations from real bugs; no free parameters are named, but the method introduces a new symbolization layer whose accuracy is not independently evidenced in the abstract.

axioms (1)

domain assumption Symbolized GUI elements can be used to infer pre- and post-conditions that capture data, temporal, and causal dependencies in web applications
Stated as the core mechanism for addressing the implicit oracle inference challenge.

invented entities (1)

Symbolization layer no independent evidence
purpose: Detects and converts critical GUI elements into symbols for oracle construction
New component introduced to enable context-aware oracle inference

pith-pipeline@v0.9.0 · 5651 in / 1450 out tokens · 42642 ms · 2026-05-16T05:59:54.039055+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 1 internal anchor

[1]

Parsa Alian, Noor Nashid, Mobina Shahbandeh, and Ali Mesbah. 2024. Semantic constraint inference for web form test generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 932–944

work page 2024
[2]

Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, and Ali Mesbah. 2025. Feature-Driven End-to-End Test Generation .2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(2025), 450–462. https://doi.org/10.1109/ICSE55347.2025.00141

work page doi:10.1109/icse55347.2025.00141 2025
[3]

Shay Artzi, Adam Kiezun, Julian Dolby, Frank Tip, Danny Dig, Amit Paradkar, and Michael D Ernst. 2008. Finding bugs in dynamic web applications. InProceedings of the 2008 international symposium on Software testing and analysis. 261–272

work page 2008
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Kesina Baral, John Johnson, Junayed Mahmud, Sabiha Salma, Mattia Fazzini, Julia Rubin, Jeff Offutt, and Kevin Moran

work page
[6]

InProceedings of the 21st International Conference on Mining Software Repositories

Automating gui-based test oracles for mobile apps. InProceedings of the 21st International Conference on Mining Software Repositories. 309–321

work page
[7]

Matteo Biagiola, Filippo Ricca, and Paolo Tonella. 2017. Search based path and input data generation for web application testing. InInternational Symposium on Search Based Software Engineering. Springer, 18–32

work page 2017
[8]

Matteo Biagiola, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2019. Diversity-based web test generation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 142–153

work page 2019
[9]

https://github.com/BookStackApp/BookStack

Bookstack 2015. https://github.com/BookStackApp/BookStack

work page 2015
[10]

Xiaoning Chang, Zheheng Liang, Yifei Zhang, Lei Cui, Zhenyue Long, Guoquan Wu, Yu Gao, Wei Chen, Jun Wei, and Tao Huang. 2023. A reinforcement learning approach to generating test cases for web applications. In2023 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 13–23

work page 2023
[11]

Antoine Chevrot, Alexandre Vernotte, Jean-Rémy Falleri, Xavier Blanc, Bruno Legeard, and Aymeric Cretin. 2025. Are Autonomous Web Agents Good Testers?Proceedings of the ACM on Software Engineering2, ISSTA (2025), 206–228

work page 2025
[12]

Anna Corazza, Sergio Di Martino, Adriano Peron, and Luigi Libero Lucio Starace. 2021. Web application testing: Using tree kernels to detect near-duplicate states in automated model inference. InProceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–6

work page 2021
[13]

https://cucumber.io/

Cucumber 2014. https://cucumber.io/

work page 2014
[14]

Sergio Di Meglio, Luigi Libero Lucio Starace, Valeria Pontillo, Ruben Opdebeeck, Coen De Roover, and Sergio Di Martino

work page
[15]

In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)

E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 836–840

work page
[16]

Zhen Dong, Marcel Böhme, Lucia Cojocaru, and Abhik Roychoudhury. 2020. Time-travel testing of android apps. In Proceedings of the ACM/IEEE 42nd international conference on software engineering. 481–492

work page 2020
[17]

Amin Milani Fard and Ali Mesbah. 2013. Feedback-directed exploration of web applications to derive test models.. In ISSRE, Vol. 13. 278–287

work page 2013
[18]

Boni García, Maurizio Leotta, Filippo Ricca, and Jim Whitehead. 2024. Use of chatgpt as an assistant in the end-to-end test script generation for android apps. InProceedings of the 15th ACM International Workshop on Automating Test Case Design, Selection and Evaluation. 5–11. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2...

work page 2024
[19]

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2025. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=kxnoqaisCT

work page 2025
[20]

https://github.com/marmelab/gremlins.js/

gremlin.js 2014. https://github.com/marmelab/gremlins.js/

work page 2014
[21]

Zhiyu Gu, Chenxu Liu, Guoquan Wu, Yifei Zhang, ChenXi Yang, Zheheng Liang, Wei Chen, and Jun Wei. 2025. Deep Reinforcement Learning for Automated Web GUI Testing.arXiv preprint arXiv:2504.19237(2025)

work page arXiv 2025
[22]

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. 2025. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833(2025)

work page arXiv 2025
[23]

https://www.qt.io/quality-assurance/squish

https://www.qt.io/quality-assurance/squish 2003. https://www.qt.io/quality-assurance/squish

work page 2003
[24]

Gang Hu, Linjie Zhu, and Junfeng Yang. 2018. AppFlow: using machine learning to synthesize robust, reusable UI tests. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 269–282

work page 2018
[25]

Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou

work page
[26]

Auitestagent: Automatic requirements oriented gui function testing.arXiv preprint arXiv:2407.09018(2024)

work page arXiv 2024
[27]

https://github.com/indico/indico

Indico 2004. https://github.com/indico/indico

work page 2004
[28]

https://github.com/invoiceninja/invoiceninja

Invoice Ninja 2018. https://github.com/invoiceninja/invoiceninja

work page 2018
[29]

https://github.com/lavague-ai/LaVague

LaVague 2024. https://github.com/lavague-ai/LaVague

work page 2024
[30]

Maurizio Leotta, Hafiz Zeeshan Yousaf, Filippo Ricca, and Boni Garcia. 2024. Ai-generated test scripts for web e2e testing with chatgpt and copilot: A preliminary study. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 339–344

work page 2024
[31]

Bobo Li, Yuheng Wang, Hao Fei, Juncheng Li, Wei Ji, Mong-Li Lee, and Wynne Hsu. 2025. FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents.arXiv preprint arXiv:2506.01520(2025)

work page arXiv 2025
[32]

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua

work page
[33]

InProceedings of the 33rd ACM International Conference on Multimedia

Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia. 8778–8786

work page
[34]

Chenxu Liu, Zhiyu Gu, Guoquan Wu, Ying Zhang, Jun Wei, and Tao Xie. 2025. Temac: Multi-Agent Collaboration for Automated Web GUI Testing.arXiv preprint arXiv:2506.00520(2025)

work page arXiv 2025
[35]

Chenxu Liu, Junheng Wang, Wei Yang, Ying Zhang, and Tao Xie. 2025. Judge: Effective State Abstraction for Guiding Automated Web GUI Testing.ACM Transactions on Software Engineering and Methodology(2025)

work page 2025
[36]

Ruofan Liu, Xiwen Teoh, Yun Lin, Guanjie Chen, Ruofei Ren, Denys Poshyvanyk, and Jin Song Dong. 2025. GUIPilot: A Consistency-Based Mobile GUI Testing Approach for Detecting Application-Specific Bugs.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 753–776

work page 2025
[37]

Xinyue Liu, Zihe Song, Weike Fang, Wei Yang, and Weihang Wang. 2024. Wefix: Intelligent automatic generation of explicit waits for efficient web end-to-end flaky tests. InProceedings of the ACM Web Conference 2024. 3043–3052

work page 2024
[38]

Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and Qing Wang. 2023. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1355–1367

work page 2023
[39]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Chatting with gpt-3 for zero-shot human-like mobile automated gui testing.arXiv preprint arXiv:2305.09434(2023)

work page arXiv 2023
[40]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024
[41]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Zhilin Tian, Yuekai Huang, Jun Hu, and Qing Wang. 2024. Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model. InProceedings of the IEEE/ACM 46th International conference on software engineering. 1–12

work page 2024
[42]

Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yawen Wang, Jun Hu, and Qing Wang. 2024. Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps.arXiv preprint arXiv:2407.03037(2024)

work page arXiv 2024
[43]

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. 2024. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203(2024)

work page arXiv 2024
[44]

Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. 2025. Uxagent: An llm agent-based usability testing framework for web design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–12

work page 2025
[45]

Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th international symposium on software testing and analysis. 94–105. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2026. WebTestPilot: Agentic End-to-End Web Testing against Natural Language ...

work page 2016
[46]

Leonardo Mariani, Mauro Pezzè, Oliviero Riganelli, and Mauro Santoro. 2011. AutoBlackTest: a tool for automatic black-box testing. InProceedings of the 33rd international conference on software engineering. 1013–1015

work page 2011
[47]

Ali Mesbah, Engin Bozdag, and Arie Van Deursen. 2008. Crawling Ajax by inferring user interface state changes. In 2008 eighth international conference on web engineering. IEEE, 122–134

work page 2008
[48]

Ali Mesbah, Arie Van Deursen, and Danny Roest. 2011. Invariant-based automatic testing of modern web applications. IEEE Transactions on Software Engineering38, 1 (2011), 35–53

work page 2011
[49]

https://developer.android.com/studio/test/other-testing-tools/monkey

Monkey 2023. https://developer.android.com/studio/test/other-testing-tools/monkey

work page 2023
[50]

Dario Olianas, Maurizio Leotta, Filippo Ricca, Matteo Biagiola, and Paolo Tonella. 2021. STILE: a tool for parallel execution of E2E web test scripts. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 460–465

work page 2021
[51]

Yu Pei, Jeongju Sohn, Sarra Habchi, and Mike Papadakis. 2025. Non-flaky and nearly optimal time-based treatment of asynchronous wait web tests.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–29

work page 2025
[52]

Sven Peldszus, Noubar Akopian, and Thorsten Berger. 2023. RobotBT: Behavior-tree-based test-case specification for the robot framework. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1503–1506

work page 2023
[53]

Chao Peng, Zhengwei Lv, Jiarong Fu, Jiayuan Liang, Zhao Zhang, Ajitha Rajan, and Ping Yang. 2024. Hawkeye: Change-targeted testing for android apps based on deep reinforcement learning. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 298–308

work page 2024
[54]

https://github.com/saleor/saleor

Prestashop 2007. https://github.com/saleor/saleor

work page 2007
[55]

https://www.grandviewresearch

Progressive Web Apps Market Size, Share & Trends Analysis Report, 2024–2030 2024. https://www.grandviewresearch. com/industry-analysis/progressive-web-apps-pwa-market-report

work page 2024
[56]

Dezhi Ran, Hao Wang, Zihe Song, Mengzhou Wu, Yuan Cao, Ying Zhang, Wei Yang, and Tao Xie. 2024. Guardian: A runtime framework for LLM-based UI exploration. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 958–970

work page 2024
[57]

https://rspec.info/

RSpec 2007. https://rspec.info/

work page 2007
[58]

Sabiha Salma, SM Hasan Mansur, Yule Zhang, and Kevin Moran. 2024. GuiEvo: Automated Evolution of Mobile App UIs. InProceedings of the 21st International Conference on Mining Software Repositories. 335–347

work page 2024
[59]

Mobina Shahbandeh, Parsa Alian, Noor Nashid, and Ali Mesbah. 2024. Naviqate: Functionality-guided web application navigation.arXiv preprint arXiv:2409.10741(2024)

work page arXiv 2024
[60]

Fei Shao, Rui Xu, Wasif Haque, Jingwei Xu, Ying Zhang, Wei Yang, Yanfang Ye, and Xusheng Xiao. 2021. Webevo: taming web application evolution via detecting semantic structure changes. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 16–28

work page 2021
[61]

Salman Sherin, Asmar Muqeet, Muhammad Uzair Khan, and Muhammad Zohaib Iqbal. 2023. QExplore: An exploration strategy for dynamic web applications using guided search.Journal of Systems and Software195 (2023), 111512

work page 2023
[62]

https://katalon.com/reports/state-quality-2024

State of Software Quality Report 2024. https://katalon.com/reports/state-quality-2024

work page 2024
[63]

Andrea Stocco, Alexandra Willi, Luigi Libero Lucio Starace, Matteo Biagiola, and Paolo Tonella. 2023. Neural embeddings for web testing.arXiv preprint arXiv:2306.07400(2023)

work page arXiv 2023
[64]

Ting Su, Lingling Fan, Sen Chen, Yang Liu, Lihua Xu, Geguang Pu, and Zhendong Su. 2020. Why my app crashes? understanding and benchmarking framework-specific exceptions of android apps.IEEE Transactions on Software Engineering48, 4 (2020), 1115–1137

work page 2020
[65]

Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su

work page
[66]

InProceedings of the 2017 11th joint meeting on foundations of software engineering

Guided, stochastic model-based GUI testing of Android apps. InProceedings of the 2017 11th joint meeting on foundations of software engineering. 245–256

work page 2017
[67]

Ting Su, Jue Wang, and Zhendong Su. 2021. Benchmarking automated gui testing for android against real-world bugs. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 119–130

work page 2021
[68]

Maryam Taeb, Amanda Swearngin, Eldon Schoop, Ruijia Cheng, Yue Jiang, and Jeffrey Nichols. 2024. Axnav: Replaying accessibility tests from natural language. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–16

work page 2024
[69]

https://d3.harvard.edu/platform-rctom/submission/the-failed- launch-of-www-healthcare-gov/

The Failed Launch Of www.HealthCare.gov 2016. https://d3.harvard.edu/platform-rctom/submission/the-failed- launch-of-www-healthcare-gov/

work page 2016
[70]

The Payroll System That Cost Queensland Health AU1.25 Billion [n. d.]. https://www.henricodolfing.com/2019/12/ project-failure-case-study-queensland-health.html

work page 2019
[71]

Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, and Yepang Liu. 2024. Leveraging large vision-language model for better automatic web GUI testing. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 125–137. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2026. FSE087:24 Teoh et al

work page 2024
[72]

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. 2025. MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents.arXiv preprint arXiv:2507.19478(2025)

work page arXiv 2025
[73]

Zhitao Wang, Wei Wang, Zirao Li, Long Wang, Can Yi, Xinjie Xu, Luyang Cao, Hanjing Su, Shouzhi Chen, and Jun Zhou. 2024. Xuat-copilot: Multi-agent collaborative system for automated user acceptance testing with large language model.arXiv preprint arXiv:2401.02705(2024)

work page arXiv 2024
[74]

Thomas D White, Gordon Fraser, and Guy J Brown. 2019. Improving random GUI testing with image-based widget detection. InProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 307–317

work page 2019
[75]

Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su. 2024. General and practical property- based testing for android apps. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 53–64

work page 2024
[76]

Rahulkrishna Yandrapally, Andrea Stocco, and Ali Mesbah. 2020. Near-duplicate detection in web app model inference. InProceedings of the ACM/IEEE 42nd international conference on software engineering. 186–197

work page 2020
[77]

Rahul Krishna Yandrapally and Ali Mesbah. 2022. Fragment-based test generation for web apps.IEEE Transactions on Software Engineering49, 3 (2022), 1086–1101

work page 2022
[78]

Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-driven mobile gui testing with autonomous large language model agents. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 129–139

work page 2024
[79]

Shengcheng Yu, Chunrong Fang, Mingzhe Du, Yuchen Ling, Zhenyu Chen, and Zhendong Su. 2024. Practical non- intrusive GUI exploration testing with visual-based robotic arms. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024
[80]

Shengcheng Yu, Chunrong Fang, Xin Li, Yuchen Ling, Zhenyu Chen, and Zhendong Su. 2024. Effective, platform- independent gui testing via image embedding and reinforcement learning.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–27

work page 2024

Showing first 80 references.

[1] [1]

Parsa Alian, Noor Nashid, Mobina Shahbandeh, and Ali Mesbah. 2024. Semantic constraint inference for web form test generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 932–944

work page 2024

[2] [2]

Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, and Ali Mesbah. 2025. Feature-Driven End-to-End Test Generation .2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(2025), 450–462. https://doi.org/10.1109/ICSE55347.2025.00141

work page doi:10.1109/icse55347.2025.00141 2025

[3] [3]

Shay Artzi, Adam Kiezun, Julian Dolby, Frank Tip, Danny Dig, Amit Paradkar, and Michael D Ernst. 2008. Finding bugs in dynamic web applications. InProceedings of the 2008 international symposium on Software testing and analysis. 261–272

work page 2008

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Kesina Baral, John Johnson, Junayed Mahmud, Sabiha Salma, Mattia Fazzini, Julia Rubin, Jeff Offutt, and Kevin Moran

work page

[6] [6]

InProceedings of the 21st International Conference on Mining Software Repositories

Automating gui-based test oracles for mobile apps. InProceedings of the 21st International Conference on Mining Software Repositories. 309–321

work page

[7] [7]

Matteo Biagiola, Filippo Ricca, and Paolo Tonella. 2017. Search based path and input data generation for web application testing. InInternational Symposium on Search Based Software Engineering. Springer, 18–32

work page 2017

[8] [8]

Matteo Biagiola, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2019. Diversity-based web test generation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 142–153

work page 2019

[9] [9]

https://github.com/BookStackApp/BookStack

Bookstack 2015. https://github.com/BookStackApp/BookStack

work page 2015

[10] [10]

Xiaoning Chang, Zheheng Liang, Yifei Zhang, Lei Cui, Zhenyue Long, Guoquan Wu, Yu Gao, Wei Chen, Jun Wei, and Tao Huang. 2023. A reinforcement learning approach to generating test cases for web applications. In2023 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 13–23

work page 2023

[11] [11]

Antoine Chevrot, Alexandre Vernotte, Jean-Rémy Falleri, Xavier Blanc, Bruno Legeard, and Aymeric Cretin. 2025. Are Autonomous Web Agents Good Testers?Proceedings of the ACM on Software Engineering2, ISSTA (2025), 206–228

work page 2025

[12] [12]

Anna Corazza, Sergio Di Martino, Adriano Peron, and Luigi Libero Lucio Starace. 2021. Web application testing: Using tree kernels to detect near-duplicate states in automated model inference. InProceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–6

work page 2021

[13] [13]

https://cucumber.io/

Cucumber 2014. https://cucumber.io/

work page 2014

[14] [14]

Sergio Di Meglio, Luigi Libero Lucio Starace, Valeria Pontillo, Ruben Opdebeeck, Coen De Roover, and Sergio Di Martino

work page

[15] [15]

In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)

E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 836–840

work page

[16] [16]

Zhen Dong, Marcel Böhme, Lucia Cojocaru, and Abhik Roychoudhury. 2020. Time-travel testing of android apps. In Proceedings of the ACM/IEEE 42nd international conference on software engineering. 481–492

work page 2020

[17] [17]

Amin Milani Fard and Ali Mesbah. 2013. Feedback-directed exploration of web applications to derive test models.. In ISSRE, Vol. 13. 278–287

work page 2013

[18] [18]

Boni García, Maurizio Leotta, Filippo Ricca, and Jim Whitehead. 2024. Use of chatgpt as an assistant in the end-to-end test script generation for android apps. InProceedings of the 15th ACM International Workshop on Automating Test Case Design, Selection and Evaluation. 5–11. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2...

work page 2024

[19] [19]

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2025. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=kxnoqaisCT

work page 2025

[20] [20]

https://github.com/marmelab/gremlins.js/

gremlin.js 2014. https://github.com/marmelab/gremlins.js/

work page 2014

[21] [21]

Zhiyu Gu, Chenxu Liu, Guoquan Wu, Yifei Zhang, ChenXi Yang, Zheheng Liang, Wei Chen, and Jun Wei. 2025. Deep Reinforcement Learning for Automated Web GUI Testing.arXiv preprint arXiv:2504.19237(2025)

work page arXiv 2025

[22] [22]

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. 2025. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833(2025)

work page arXiv 2025

[23] [23]

https://www.qt.io/quality-assurance/squish

https://www.qt.io/quality-assurance/squish 2003. https://www.qt.io/quality-assurance/squish

work page 2003

[24] [24]

Gang Hu, Linjie Zhu, and Junfeng Yang. 2018. AppFlow: using machine learning to synthesize robust, reusable UI tests. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 269–282

work page 2018

[25] [25]

Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou

work page

[26] [26]

Auitestagent: Automatic requirements oriented gui function testing.arXiv preprint arXiv:2407.09018(2024)

work page arXiv 2024

[27] [27]

https://github.com/indico/indico

Indico 2004. https://github.com/indico/indico

work page 2004

[28] [28]

https://github.com/invoiceninja/invoiceninja

Invoice Ninja 2018. https://github.com/invoiceninja/invoiceninja

work page 2018

[29] [29]

https://github.com/lavague-ai/LaVague

LaVague 2024. https://github.com/lavague-ai/LaVague

work page 2024

[30] [30]

Maurizio Leotta, Hafiz Zeeshan Yousaf, Filippo Ricca, and Boni Garcia. 2024. Ai-generated test scripts for web e2e testing with chatgpt and copilot: A preliminary study. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 339–344

work page 2024

[31] [31]

Bobo Li, Yuheng Wang, Hao Fei, Juncheng Li, Wei Ji, Mong-Li Lee, and Wynne Hsu. 2025. FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents.arXiv preprint arXiv:2506.01520(2025)

work page arXiv 2025

[32] [32]

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua

work page

[33] [33]

InProceedings of the 33rd ACM International Conference on Multimedia

Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia. 8778–8786

work page

[34] [34]

Chenxu Liu, Zhiyu Gu, Guoquan Wu, Ying Zhang, Jun Wei, and Tao Xie. 2025. Temac: Multi-Agent Collaboration for Automated Web GUI Testing.arXiv preprint arXiv:2506.00520(2025)

work page arXiv 2025

[35] [35]

Chenxu Liu, Junheng Wang, Wei Yang, Ying Zhang, and Tao Xie. 2025. Judge: Effective State Abstraction for Guiding Automated Web GUI Testing.ACM Transactions on Software Engineering and Methodology(2025)

work page 2025

[36] [36]

Ruofan Liu, Xiwen Teoh, Yun Lin, Guanjie Chen, Ruofei Ren, Denys Poshyvanyk, and Jin Song Dong. 2025. GUIPilot: A Consistency-Based Mobile GUI Testing Approach for Detecting Application-Specific Bugs.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 753–776

work page 2025

[37] [37]

Xinyue Liu, Zihe Song, Weike Fang, Wei Yang, and Weihang Wang. 2024. Wefix: Intelligent automatic generation of explicit waits for efficient web end-to-end flaky tests. InProceedings of the ACM Web Conference 2024. 3043–3052

work page 2024

[38] [38]

Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and Qing Wang. 2023. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1355–1367

work page 2023

[39] [39]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Chatting with gpt-3 for zero-shot human-like mobile automated gui testing.arXiv preprint arXiv:2305.09434(2023)

work page arXiv 2023

[40] [40]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024

[41] [41]

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Zhilin Tian, Yuekai Huang, Jun Hu, and Qing Wang. 2024. Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model. InProceedings of the IEEE/ACM 46th International conference on software engineering. 1–12

work page 2024

[42] [42]

Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yawen Wang, Jun Hu, and Qing Wang. 2024. Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps.arXiv preprint arXiv:2407.03037(2024)

work page arXiv 2024

[43] [43]

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. 2024. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203(2024)

work page arXiv 2024

[44] [44]

Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. 2025. Uxagent: An llm agent-based usability testing framework for web design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–12

work page 2025

[45] [45]

Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th international symposium on software testing and analysis. 94–105. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2026. WebTestPilot: Agentic End-to-End Web Testing against Natural Language ...

work page 2016

[46] [46]

Leonardo Mariani, Mauro Pezzè, Oliviero Riganelli, and Mauro Santoro. 2011. AutoBlackTest: a tool for automatic black-box testing. InProceedings of the 33rd international conference on software engineering. 1013–1015

work page 2011

[47] [47]

Ali Mesbah, Engin Bozdag, and Arie Van Deursen. 2008. Crawling Ajax by inferring user interface state changes. In 2008 eighth international conference on web engineering. IEEE, 122–134

work page 2008

[48] [48]

Ali Mesbah, Arie Van Deursen, and Danny Roest. 2011. Invariant-based automatic testing of modern web applications. IEEE Transactions on Software Engineering38, 1 (2011), 35–53

work page 2011

[49] [49]

https://developer.android.com/studio/test/other-testing-tools/monkey

Monkey 2023. https://developer.android.com/studio/test/other-testing-tools/monkey

work page 2023

[50] [50]

Dario Olianas, Maurizio Leotta, Filippo Ricca, Matteo Biagiola, and Paolo Tonella. 2021. STILE: a tool for parallel execution of E2E web test scripts. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 460–465

work page 2021

[51] [51]

Yu Pei, Jeongju Sohn, Sarra Habchi, and Mike Papadakis. 2025. Non-flaky and nearly optimal time-based treatment of asynchronous wait web tests.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–29

work page 2025

[52] [52]

Sven Peldszus, Noubar Akopian, and Thorsten Berger. 2023. RobotBT: Behavior-tree-based test-case specification for the robot framework. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1503–1506

work page 2023

[53] [53]

Chao Peng, Zhengwei Lv, Jiarong Fu, Jiayuan Liang, Zhao Zhang, Ajitha Rajan, and Ping Yang. 2024. Hawkeye: Change-targeted testing for android apps based on deep reinforcement learning. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 298–308

work page 2024

[54] [54]

https://github.com/saleor/saleor

Prestashop 2007. https://github.com/saleor/saleor

work page 2007

[55] [55]

https://www.grandviewresearch

Progressive Web Apps Market Size, Share & Trends Analysis Report, 2024–2030 2024. https://www.grandviewresearch. com/industry-analysis/progressive-web-apps-pwa-market-report

work page 2024

[56] [56]

Dezhi Ran, Hao Wang, Zihe Song, Mengzhou Wu, Yuan Cao, Ying Zhang, Wei Yang, and Tao Xie. 2024. Guardian: A runtime framework for LLM-based UI exploration. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 958–970

work page 2024

[57] [57]

https://rspec.info/

RSpec 2007. https://rspec.info/

work page 2007

[58] [58]

Sabiha Salma, SM Hasan Mansur, Yule Zhang, and Kevin Moran. 2024. GuiEvo: Automated Evolution of Mobile App UIs. InProceedings of the 21st International Conference on Mining Software Repositories. 335–347

work page 2024

[59] [59]

Mobina Shahbandeh, Parsa Alian, Noor Nashid, and Ali Mesbah. 2024. Naviqate: Functionality-guided web application navigation.arXiv preprint arXiv:2409.10741(2024)

work page arXiv 2024

[60] [60]

Fei Shao, Rui Xu, Wasif Haque, Jingwei Xu, Ying Zhang, Wei Yang, Yanfang Ye, and Xusheng Xiao. 2021. Webevo: taming web application evolution via detecting semantic structure changes. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 16–28

work page 2021

[61] [61]

Salman Sherin, Asmar Muqeet, Muhammad Uzair Khan, and Muhammad Zohaib Iqbal. 2023. QExplore: An exploration strategy for dynamic web applications using guided search.Journal of Systems and Software195 (2023), 111512

work page 2023

[62] [62]

https://katalon.com/reports/state-quality-2024

State of Software Quality Report 2024. https://katalon.com/reports/state-quality-2024

work page 2024

[63] [63]

Andrea Stocco, Alexandra Willi, Luigi Libero Lucio Starace, Matteo Biagiola, and Paolo Tonella. 2023. Neural embeddings for web testing.arXiv preprint arXiv:2306.07400(2023)

work page arXiv 2023

[64] [64]

Ting Su, Lingling Fan, Sen Chen, Yang Liu, Lihua Xu, Geguang Pu, and Zhendong Su. 2020. Why my app crashes? understanding and benchmarking framework-specific exceptions of android apps.IEEE Transactions on Software Engineering48, 4 (2020), 1115–1137

work page 2020

[65] [65]

Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su

work page

[66] [66]

InProceedings of the 2017 11th joint meeting on foundations of software engineering

Guided, stochastic model-based GUI testing of Android apps. InProceedings of the 2017 11th joint meeting on foundations of software engineering. 245–256

work page 2017

[67] [67]

Ting Su, Jue Wang, and Zhendong Su. 2021. Benchmarking automated gui testing for android against real-world bugs. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 119–130

work page 2021

[68] [68]

Maryam Taeb, Amanda Swearngin, Eldon Schoop, Ruijia Cheng, Yue Jiang, and Jeffrey Nichols. 2024. Axnav: Replaying accessibility tests from natural language. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–16

work page 2024

[69] [69]

https://d3.harvard.edu/platform-rctom/submission/the-failed- launch-of-www-healthcare-gov/

The Failed Launch Of www.HealthCare.gov 2016. https://d3.harvard.edu/platform-rctom/submission/the-failed- launch-of-www-healthcare-gov/

work page 2016

[70] [70]

The Payroll System That Cost Queensland Health AU1.25 Billion [n. d.]. https://www.henricodolfing.com/2019/12/ project-failure-case-study-queensland-health.html

work page 2019

[71] [71]

Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, and Yepang Liu. 2024. Leveraging large vision-language model for better automatic web GUI testing. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 125–137. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2026. FSE087:24 Teoh et al

work page 2024

[72] [72]

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. 2025. MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents.arXiv preprint arXiv:2507.19478(2025)

work page arXiv 2025

[73] [73]

Zhitao Wang, Wei Wang, Zirao Li, Long Wang, Can Yi, Xinjie Xu, Luyang Cao, Hanjing Su, Shouzhi Chen, and Jun Zhou. 2024. Xuat-copilot: Multi-agent collaborative system for automated user acceptance testing with large language model.arXiv preprint arXiv:2401.02705(2024)

work page arXiv 2024

[74] [74]

Thomas D White, Gordon Fraser, and Guy J Brown. 2019. Improving random GUI testing with image-based widget detection. InProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 307–317

work page 2019

[75] [75]

Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su. 2024. General and practical property- based testing for android apps. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 53–64

work page 2024

[76] [76]

Rahulkrishna Yandrapally, Andrea Stocco, and Ali Mesbah. 2020. Near-duplicate detection in web app model inference. InProceedings of the ACM/IEEE 42nd international conference on software engineering. 186–197

work page 2020

[77] [77]

Rahul Krishna Yandrapally and Ali Mesbah. 2022. Fragment-based test generation for web apps.IEEE Transactions on Software Engineering49, 3 (2022), 1086–1101

work page 2022

[78] [78]

Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-driven mobile gui testing with autonomous large language model agents. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 129–139

work page 2024

[79] [79]

Shengcheng Yu, Chunrong Fang, Mingzhe Du, Yuchen Ling, Zhenyu Chen, and Zhendong Su. 2024. Practical non- intrusive GUI exploration testing with visual-based robotic arms. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

work page 2024

[80] [80]

Shengcheng Yu, Chunrong Fang, Xin Li, Yuchen Ling, Zhenyu Chen, and Zhendong Su. 2024. Effective, platform- independent gui testing via image embedding and reinforcement learning.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–27

work page 2024