pith. sign in

arxiv: 2602.11724 · v3 · submitted 2026-02-12 · 💻 cs.SE

WebTestPilot: Agentic End-to-End Web Testing against Natural Language Specification by Inferring Oracles with Symbolized GUI Elements

Pith reviewed 2026-05-16 05:59 UTC · model grok-4.3

classification 💻 cs.SE
keywords web testingLLM agentsnatural language specificationsoracle inferenceGUI symbolizationend-to-end testingbug detectionhallucination mitigation
0
0 comments X

The pith

WebTestPilot uses a symbolization layer on GUI elements to infer pre- and post-condition oracles that let LLM agents test web apps against natural language specifications while separating hallucinations from real bugs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WebTestPilot as an LLM-based agent for end-to-end web testing from natural language requirements. It adds a symbolization layer that turns critical GUI elements into variables and then derives pre- and post-conditions for each test step to serve as an implicit oracle. These conditions capture data, temporal, and causal dependencies across steps so the agent can validate requirements that would otherwise be missed by simple navigation checks or isolated state verification. The method is evaluated on a new benchmark of web applications with injected bugs, where it reaches 99 percent task completion and 96 percent precision and recall in bug detection. The results hold across varied natural language inputs and different model sizes.

Core claim

WebTestPilot is an LLM-based agent for end-to-end web testing that first detects and symbolizes critical GUI elements into variables and then translates a natural language specification into a sequence of steps, each equipped with inferred pre- and post-conditions over those symbols. These oracles capture dependencies that allow the agent to act as its own validator and distinguish inconsistencies caused by model hallucinations from genuine application bugs. Existing approaches either accept any crash-free navigation or examine states in isolation and therefore miss context-dependent failures.

What carries the argument

Symbolization layer that converts critical GUI elements into variables, paired with inference of pre- and post-conditions over those variables to form per-step oracles.

If this is right

  • LLM agents can now perform reliable end-to-end testing against natural language specifications without needing manually written oracles.
  • Context-dependent bugs that span multiple steps become detectable because oracles track data, temporal, and causal dependencies.
  • The same agent generalizes across different natural language phrasings and across model scales without retraining.
  • A reproducible benchmark of bug-injected web applications now exists for systematic comparison of NL-to-E2E testing methods.
  • The approach directly addresses the hallucination problem that previously made LLM agents untrustworthy as oracles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same symbolization-plus-oracle pattern could be adapted to mobile or desktop interfaces where GUI elements are also extractable as structured variables.
  • Adding an explicit symbolic layer may reduce error propagation in other long-horizon LLM agent tasks that require consistent state tracking.
  • The benchmark could serve as a test bed for comparing LLM agents against traditional scripted or model-based testing frameworks.
  • If symbolization accuracy improves with better vision models, overall bug detection rates could rise further without changes to the oracle logic.

Load-bearing premise

The symbolization layer must correctly identify the critical GUI elements, and the inferred pre- and post-conditions must accurately capture the implicit requirements without overlooking context-dependent failures.

What would settle it

A web application and natural language specification where WebTestPilot either reports a bug that is not present in the code or fails to report a real bug that violates the specification because the symbolization or oracle inference missed the relevant dependency.

Figures

Figures reproduced from arXiv: 2602.11724 by Duc-Minh Nguyen, Jin Song Dong, Ruofei Ren, Wenjie Zhang, Xiwen Teoh, Yun Lin.

Figure 1
Figure 1. Figure 1: Inconsistent reasoning by different LLMs with multiple trials in test state verification. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A test flow depicting search, shopping, and checkout on e-commerce platform [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Definition of the Product and Cart symbols, represented as Pydantic schemas. WebTestPilot then instantiates these schemas with values extracted from the current and prior states. By referencing page reidentification, it recognizes that State (a) and State (f) correspond to the Cart page and learns a high-level overview of its layout (e.g., the cart contains a list of items, each displaying specific informa… view at source ↗
Figure 4
Figure 4. Figure 4: Extracting the added product from product page and comparing current and prior cart details. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example (from motivating scenario): WebTestPilot extracts symbols via declared schemas that corre￾spond to GUI elements for making assertions on the application state. # Verify products (title, quantity, price) consistency for prod in prior + [added]: match = next((p for p in current if p.title == prod.title), None) assert match is not None, f"Product {prod.title} missing in current cart" assert match.quan… view at source ↗
Figure 5
Figure 5. Figure 5: Assertion generated by WebTestPilot. 3 Problem Statement Preliminary. We model a web application W as a graph of states 𝑠 ∈ S. Each state is defined as a tuple 𝑠 = (screenshot, DOM), where screenshot encodes the visual appearance of the page, and DOM is a rooted, ordered tree of UI elements 𝑒, where each element encodes its type (i.e., button, input), relevant attributes (e.g., name, value, enabled/disable… view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the problem statement’s input and output. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of WebTestPilot. WebTestPilot parses a natural language requirement into structured steps (Input Parsing), each specifying a condition, action, and expectation. For each step, it performs Oracle Inference to generate predicate assertions over symbols capturing explicit and implicit requirements. During Oracle Execution, it checks preconditions, executes the action, checks postconditions. Failed as… view at source ↗
Figure 9
Figure 9. Figure 9: BNF syntax of DSL for writing test assertions [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: An example test assertion for a test step. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example of transformed test requirements. Original text: “ [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance of different models (RQ4) under different transformed input requirements (RQ3). [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
read the original abstract

Visual language model (VLM) agents show great promise in automating end-to-end (E2E) web testing against requirements in natural language. However, the probabilistic nature of language models can have inherent hallucinations. Therefore, given a detected inconsistency between the requirement and the web application, it is hard to distinguish whether it stems from the hallucination or a real application bug. Addressing this issue presents two core technical challenges: the implicit oracle inference challenge, where the agent must act as its own oracle to implicitly decide if the application's behavior is correct without guidance, and the probabilistic inference challenge, where an LLM's inconsistent reasoning undermines its trustworthiness as an oracle. Existing LLM-based approaches fail to capture such implicit oracles, either by treating any page navigation that doesn't crash as a success, or by checking each state in isolation, thus missing bugs dependent on context from prior steps. We introduce WebTestPilot, an LLM-based agent designed to address these challenges. WebTestPilot uses (1) a symbolization layer which detects and symbolizes critical GUI elements on the web application into symbols (i.e., variables) and (2) translates natural language specification into a sequence of steps, each of which is equipped with inferred pre- and post-conditions over the symbols as an oracle. This oracle captures data, temporal, and causal dependencies, enabling the validation of implicit requirements. To advance research in this area, we build a benchmark of bug-injected web apps for evaluating NL-to-E2E testing. The results show that WebTestPilot achieves a task completion rate of 99%, with 96% precision and 96% recall in bug detection, outperforming the best baseline (+70 precision, +27 recall). The agent generalizes across diverse natural language inputs and model scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces WebTestPilot, an LLM-based agent for end-to-end web testing from natural language specifications. It uses a symbolization layer to map GUI elements to stable symbols and infers pre- and post-conditions over those symbols to act as oracles that capture data, temporal, and causal dependencies. This is intended to distinguish real bugs from model hallucinations. The authors construct a new benchmark of bug-injected web apps and report 99% task completion, 96% precision, and 96% recall in bug detection, outperforming the strongest baseline by +70 precision and +27 recall while generalizing across NL inputs and model scales.

Significance. If the performance claims hold after proper validation, the work would advance automated web testing by providing a concrete mechanism for implicit oracle inference that existing methods lack. The symbolization-plus-oracle approach and the new benchmark are useful contributions that could support follow-on research. The reported generalization across model scales is a positive signal for practical deployment.

major comments (3)
  1. [Evaluation] Evaluation section: benchmark construction details—including bug injection procedure, how ground-truth oracles are established, and selection criteria for the injected bugs—are not provided. These details are load-bearing for the 96% precision/recall claims, as the metrics cannot be interpreted without knowing whether the injected bugs are representative or whether the evaluation inadvertently favors the proposed oracle inference.
  2. [Method] Method section (symbolization and oracle inference): no accuracy metric, error analysis, or ablation is reported for the symbolization layer itself. Because the central claim rests on the assumption that symbolization reliably extracts critical elements and enables valid pre/post-condition inference, the absence of this analysis leaves the source of the +70/+27 gains unclear.
  3. [Results] Results section: no statistical tests, run-to-run variance, or breakdown of failure cases (e.g., symbolization errors vs. oracle mis-inference) are supplied for the headline metrics. This omission prevents assessment of whether the reported superiority over baselines is robust.
minor comments (2)
  1. [Abstract] Abstract: the baseline comparison states absolute gains but does not name the strongest baseline or report its absolute scores, making the improvement harder to contextualize.
  2. [Method] Notation: the mapping from GUI elements to symbols is described at a high level; a small example showing an actual page state, the extracted symbols, and the resulting pre/post-conditions would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript requires additional details and analyses to support the claims and will revise accordingly.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: benchmark construction details—including bug injection procedure, how ground-truth oracles are established, and selection criteria for the injected bugs—are not provided. These details are load-bearing for the 96% precision/recall claims, as the metrics cannot be interpreted without knowing whether the injected bugs are representative or whether the evaluation inadvertently favors the proposed oracle inference.

    Authors: We agree that the benchmark details are insufficient. In the revised manuscript we will add a dedicated subsection in Evaluation describing the bug injection procedure (with concrete examples of data, temporal, and causal bugs), the process for establishing ground-truth oracles via independent expert annotation of each test case, and the selection criteria used to ensure the injected bugs are representative of real-world web faults and not biased toward our oracle mechanism. revision: yes

  2. Referee: [Method] Method section (symbolization and oracle inference): no accuracy metric, error analysis, or ablation is reported for the symbolization layer itself. Because the central claim rests on the assumption that symbolization reliably extracts critical elements and enables valid pre/post-condition inference, the absence of this analysis leaves the source of the +70/+27 gains unclear.

    Authors: We acknowledge the absence of direct validation for the symbolization layer. We will augment the Method section with (1) an accuracy metric for symbolization on a held-out set of pages, (2) a qualitative error analysis of common failure modes, and (3) an ablation that removes the symbolization layer to quantify its contribution to the observed gains over baselines. revision: yes

  3. Referee: [Results] Results section: no statistical tests, run-to-run variance, or breakdown of failure cases (e.g., symbolization errors vs. oracle mis-inference) are supplied for the headline metrics. This omission prevents assessment of whether the reported superiority over baselines is robust.

    Authors: We will strengthen the Results section by adding statistical significance tests (e.g., McNemar’s test for paired comparisons), reporting run-to-run variance obtained by re-executing the experiments with different random seeds, and providing a breakdown of failure cases categorized by source (symbolization errors, oracle mis-inference, navigation failures, etc.). These additions will allow readers to assess the robustness of the +70 precision and +27 recall improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an LLM-based agent with a symbolization layer and inferred pre/post-condition oracles, evaluated empirically on a newly constructed benchmark of bug-injected apps. No equations, fitted parameters, or self-citations are presented that reduce the reported 99% task completion or 96% precision/recall metrics to inputs by construction. The performance claims rest on external evaluation rather than definitional equivalence or load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that symbolization plus inferred conditions can reliably separate hallucinations from real bugs; no free parameters are named, but the method introduces a new symbolization layer whose accuracy is not independently evidenced in the abstract.

axioms (1)
  • domain assumption Symbolized GUI elements can be used to infer pre- and post-conditions that capture data, temporal, and causal dependencies in web applications
    Stated as the core mechanism for addressing the implicit oracle inference challenge.
invented entities (1)
  • Symbolization layer no independent evidence
    purpose: Detects and converts critical GUI elements into symbols for oracle construction
    New component introduced to enable context-aware oracle inference

pith-pipeline@v0.9.0 · 5651 in / 1450 out tokens · 42642 ms · 2026-05-16T05:59:54.039055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 1 internal anchor

  1. [1]

    Parsa Alian, Noor Nashid, Mobina Shahbandeh, and Ali Mesbah. 2024. Semantic constraint inference for web form test generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 932–944

  2. [2]

    Parsa Alian, Noor Nashid, Mobina Shahbandeh, Taha Shabani, and Ali Mesbah. 2025. Feature-Driven End-to-End Test Generation .2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(2025), 450–462. https://doi.org/10.1109/ICSE55347.2025.00141

  3. [3]

    Shay Artzi, Adam Kiezun, Julian Dolby, Frank Tip, Danny Dig, Amit Paradkar, and Michael D Ernst. 2008. Finding bugs in dynamic web applications. InProceedings of the 2008 international symposium on Software testing and analysis. 261–272

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923(2025)

  5. [5]

    Kesina Baral, John Johnson, Junayed Mahmud, Sabiha Salma, Mattia Fazzini, Julia Rubin, Jeff Offutt, and Kevin Moran

  6. [6]

    InProceedings of the 21st International Conference on Mining Software Repositories

    Automating gui-based test oracles for mobile apps. InProceedings of the 21st International Conference on Mining Software Repositories. 309–321

  7. [7]

    Matteo Biagiola, Filippo Ricca, and Paolo Tonella. 2017. Search based path and input data generation for web application testing. InInternational Symposium on Search Based Software Engineering. Springer, 18–32

  8. [8]

    Matteo Biagiola, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2019. Diversity-based web test generation. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 142–153

  9. [9]

    https://github.com/BookStackApp/BookStack

    Bookstack 2015. https://github.com/BookStackApp/BookStack

  10. [10]

    Xiaoning Chang, Zheheng Liang, Yifei Zhang, Lei Cui, Zhenyue Long, Guoquan Wu, Yu Gao, Wei Chen, Jun Wei, and Tao Huang. 2023. A reinforcement learning approach to generating test cases for web applications. In2023 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 13–23

  11. [11]

    Antoine Chevrot, Alexandre Vernotte, Jean-Rémy Falleri, Xavier Blanc, Bruno Legeard, and Aymeric Cretin. 2025. Are Autonomous Web Agents Good Testers?Proceedings of the ACM on Software Engineering2, ISSTA (2025), 206–228

  12. [12]

    Anna Corazza, Sergio Di Martino, Adriano Peron, and Luigi Libero Lucio Starace. 2021. Web application testing: Using tree kernels to detect near-duplicate states in automated model inference. InProceedings of the 15th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 1–6

  13. [13]

    https://cucumber.io/

    Cucumber 2014. https://cucumber.io/

  14. [14]

    Sergio Di Meglio, Luigi Libero Lucio Starace, Valeria Pontillo, Ruben Opdebeeck, Coen De Roover, and Sergio Di Martino

  15. [15]

    In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)

    E2EGit: A Dataset of End-to-End Web Tests in Open Source Projects. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 836–840

  16. [16]

    Zhen Dong, Marcel Böhme, Lucia Cojocaru, and Abhik Roychoudhury. 2020. Time-travel testing of android apps. In Proceedings of the ACM/IEEE 42nd international conference on software engineering. 481–492

  17. [17]

    Amin Milani Fard and Ali Mesbah. 2013. Feedback-directed exploration of web applications to derive test models.. In ISSRE, Vol. 13. 278–287

  18. [18]

    Boni García, Maurizio Leotta, Filippo Ricca, and Jim Whitehead. 2024. Use of chatgpt as an assistant in the end-to-end test script generation for android apps. InProceedings of the 15th ACM International Workshop on Automating Test Case Design, Selection and Evaluation. 5–11. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2...

  19. [19]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2025. Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=kxnoqaisCT

  20. [20]

    https://github.com/marmelab/gremlins.js/

    gremlin.js 2014. https://github.com/marmelab/gremlins.js/

  21. [21]

    Zhiyu Gu, Chenxu Liu, Guoquan Wu, Yifei Zhang, ChenXi Yang, Zheheng Liang, Wei Chen, and Jun Wei. 2025. Deep Reinforcement Learning for Automated Web GUI Testing.arXiv preprint arXiv:2504.19237(2025)

  22. [22]

    Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. 2025. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833(2025)

  23. [23]

    https://www.qt.io/quality-assurance/squish

    https://www.qt.io/quality-assurance/squish 2003. https://www.qt.io/quality-assurance/squish

  24. [24]

    Gang Hu, Linjie Zhu, and Junfeng Yang. 2018. AppFlow: using machine learning to synthesize robust, reusable UI tests. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 269–282

  25. [25]

    Yongxiang Hu, Xuan Wang, Yingchuan Wang, Yu Zhang, Shiyu Guo, Chaoyi Chen, Xin Wang, and Yangfan Zhou

  26. [26]

    Auitestagent: Automatic requirements oriented gui function testing.arXiv preprint arXiv:2407.09018(2024)

  27. [27]

    https://github.com/indico/indico

    Indico 2004. https://github.com/indico/indico

  28. [28]

    https://github.com/invoiceninja/invoiceninja

    Invoice Ninja 2018. https://github.com/invoiceninja/invoiceninja

  29. [29]

    https://github.com/lavague-ai/LaVague

    LaVague 2024. https://github.com/lavague-ai/LaVague

  30. [30]

    Maurizio Leotta, Hafiz Zeeshan Yousaf, Filippo Ricca, and Boni Garcia. 2024. Ai-generated test scripts for web e2e testing with chatgpt and copilot: A preliminary study. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 339–344

  31. [31]

    Bobo Li, Yuheng Wang, Hao Fei, Juncheng Li, Wei Ji, Mong-Li Lee, and Wynne Hsu. 2025. FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents.arXiv preprint arXiv:2506.01520(2025)

  32. [32]

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua

  33. [33]

    InProceedings of the 33rd ACM International Conference on Multimedia

    Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia. 8778–8786

  34. [34]

    Chenxu Liu, Zhiyu Gu, Guoquan Wu, Ying Zhang, Jun Wei, and Tao Xie. 2025. Temac: Multi-Agent Collaboration for Automated Web GUI Testing.arXiv preprint arXiv:2506.00520(2025)

  35. [35]

    Chenxu Liu, Junheng Wang, Wei Yang, Ying Zhang, and Tao Xie. 2025. Judge: Effective State Abstraction for Guiding Automated Web GUI Testing.ACM Transactions on Software Engineering and Methodology(2025)

  36. [36]

    Ruofan Liu, Xiwen Teoh, Yun Lin, Guanjie Chen, Ruofei Ren, Denys Poshyvanyk, and Jin Song Dong. 2025. GUIPilot: A Consistency-Based Mobile GUI Testing Approach for Detecting Application-Specific Bugs.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 753–776

  37. [37]

    Xinyue Liu, Zihe Song, Weike Fang, Wei Yang, and Weihang Wang. 2024. Wefix: Intelligent automatic generation of explicit waits for efficient web end-to-end flaky tests. InProceedings of the ACM Web Conference 2024. 3043–3052

  38. [38]

    Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and Qing Wang. 2023. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1355–1367

  39. [39]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Chatting with gpt-3 for zero-shot human-like mobile automated gui testing.arXiv preprint arXiv:2305.09434(2023)

  40. [40]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  41. [41]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Zhilin Tian, Yuekai Huang, Jun Hu, and Qing Wang. 2024. Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model. InProceedings of the IEEE/ACM 46th International conference on software engineering. 1–12

  42. [42]

    Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yawen Wang, Jun Hu, and Qing Wang. 2024. Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps.arXiv preprint arXiv:2407.03037(2024)

  43. [43]

    Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. 2024. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203(2024)

  44. [44]

    Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. 2025. Uxagent: An llm agent-based usability testing framework for web design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–12

  45. [45]

    Ke Mao, Mark Harman, and Yue Jia. 2016. Sapienz: Multi-objective automated testing for android applications. In Proceedings of the 25th international symposium on software testing and analysis. 94–105. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2026. WebTestPilot: Agentic End-to-End Web Testing against Natural Language ...

  46. [46]

    Leonardo Mariani, Mauro Pezzè, Oliviero Riganelli, and Mauro Santoro. 2011. AutoBlackTest: a tool for automatic black-box testing. InProceedings of the 33rd international conference on software engineering. 1013–1015

  47. [47]

    Ali Mesbah, Engin Bozdag, and Arie Van Deursen. 2008. Crawling Ajax by inferring user interface state changes. In 2008 eighth international conference on web engineering. IEEE, 122–134

  48. [48]

    Ali Mesbah, Arie Van Deursen, and Danny Roest. 2011. Invariant-based automatic testing of modern web applications. IEEE Transactions on Software Engineering38, 1 (2011), 35–53

  49. [49]

    https://developer.android.com/studio/test/other-testing-tools/monkey

    Monkey 2023. https://developer.android.com/studio/test/other-testing-tools/monkey

  50. [50]

    Dario Olianas, Maurizio Leotta, Filippo Ricca, Matteo Biagiola, and Paolo Tonella. 2021. STILE: a tool for parallel execution of E2E web test scripts. In2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 460–465

  51. [51]

    Yu Pei, Jeongju Sohn, Sarra Habchi, and Mike Papadakis. 2025. Non-flaky and nearly optimal time-based treatment of asynchronous wait web tests.ACM Transactions on Software Engineering and Methodology34, 2 (2025), 1–29

  52. [52]

    Sven Peldszus, Noubar Akopian, and Thorsten Berger. 2023. RobotBT: Behavior-tree-based test-case specification for the robot framework. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 1503–1506

  53. [53]

    Chao Peng, Zhengwei Lv, Jiarong Fu, Jiayuan Liang, Zhao Zhang, Ajitha Rajan, and Ping Yang. 2024. Hawkeye: Change-targeted testing for android apps based on deep reinforcement learning. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice. 298–308

  54. [54]

    https://github.com/saleor/saleor

    Prestashop 2007. https://github.com/saleor/saleor

  55. [55]

    https://www.grandviewresearch

    Progressive Web Apps Market Size, Share & Trends Analysis Report, 2024–2030 2024. https://www.grandviewresearch. com/industry-analysis/progressive-web-apps-pwa-market-report

  56. [56]

    Dezhi Ran, Hao Wang, Zihe Song, Mengzhou Wu, Yuan Cao, Ying Zhang, Wei Yang, and Tao Xie. 2024. Guardian: A runtime framework for LLM-based UI exploration. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 958–970

  57. [57]

    https://rspec.info/

    RSpec 2007. https://rspec.info/

  58. [58]

    Sabiha Salma, SM Hasan Mansur, Yule Zhang, and Kevin Moran. 2024. GuiEvo: Automated Evolution of Mobile App UIs. InProceedings of the 21st International Conference on Mining Software Repositories. 335–347

  59. [59]

    Mobina Shahbandeh, Parsa Alian, Noor Nashid, and Ali Mesbah. 2024. Naviqate: Functionality-guided web application navigation.arXiv preprint arXiv:2409.10741(2024)

  60. [60]

    Fei Shao, Rui Xu, Wasif Haque, Jingwei Xu, Ying Zhang, Wei Yang, Yanfang Ye, and Xusheng Xiao. 2021. Webevo: taming web application evolution via detecting semantic structure changes. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 16–28

  61. [61]

    Salman Sherin, Asmar Muqeet, Muhammad Uzair Khan, and Muhammad Zohaib Iqbal. 2023. QExplore: An exploration strategy for dynamic web applications using guided search.Journal of Systems and Software195 (2023), 111512

  62. [62]

    https://katalon.com/reports/state-quality-2024

    State of Software Quality Report 2024. https://katalon.com/reports/state-quality-2024

  63. [63]

    Andrea Stocco, Alexandra Willi, Luigi Libero Lucio Starace, Matteo Biagiola, and Paolo Tonella. 2023. Neural embeddings for web testing.arXiv preprint arXiv:2306.07400(2023)

  64. [64]

    Ting Su, Lingling Fan, Sen Chen, Yang Liu, Lihua Xu, Geguang Pu, and Zhendong Su. 2020. Why my app crashes? understanding and benchmarking framework-specific exceptions of android apps.IEEE Transactions on Software Engineering48, 4 (2020), 1115–1137

  65. [65]

    Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su

  66. [66]

    InProceedings of the 2017 11th joint meeting on foundations of software engineering

    Guided, stochastic model-based GUI testing of Android apps. InProceedings of the 2017 11th joint meeting on foundations of software engineering. 245–256

  67. [67]

    Ting Su, Jue Wang, and Zhendong Su. 2021. Benchmarking automated gui testing for android against real-world bugs. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 119–130

  68. [68]

    Maryam Taeb, Amanda Swearngin, Eldon Schoop, Ruijia Cheng, Yue Jiang, and Jeffrey Nichols. 2024. Axnav: Replaying accessibility tests from natural language. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–16

  69. [69]

    https://d3.harvard.edu/platform-rctom/submission/the-failed- launch-of-www-healthcare-gov/

    The Failed Launch Of www.HealthCare.gov 2016. https://d3.harvard.edu/platform-rctom/submission/the-failed- launch-of-www-healthcare-gov/

  70. [70]

    The Payroll System That Cost Queensland Health AU1.25 Billion [n. d.]. https://www.henricodolfing.com/2019/12/ project-failure-case-study-queensland-health.html

  71. [71]

    Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, and Yepang Liu. 2024. Leveraging large vision-language model for better automatic web GUI testing. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 125–137. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE087. Publication date: July 2026. FSE087:24 Teoh et al

  72. [72]

    Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. 2025. MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents.arXiv preprint arXiv:2507.19478(2025)

  73. [73]

    Zhitao Wang, Wei Wang, Zirao Li, Long Wang, Can Yi, Xinjie Xu, Luyang Cao, Hanjing Su, Shouzhi Chen, and Jun Zhou. 2024. Xuat-copilot: Multi-agent collaborative system for automated user acceptance testing with large language model.arXiv preprint arXiv:2401.02705(2024)

  74. [74]

    Thomas D White, Gordon Fraser, and Guy J Brown. 2019. Improving random GUI testing with image-based widget detection. InProceedings of the 28th ACM SIGSOFT international symposium on software testing and analysis. 307–317

  75. [75]

    Yiheng Xiong, Ting Su, Jue Wang, Jingling Sun, Geguang Pu, and Zhendong Su. 2024. General and practical property- based testing for android apps. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 53–64

  76. [76]

    Rahulkrishna Yandrapally, Andrea Stocco, and Ali Mesbah. 2020. Near-duplicate detection in web app model inference. InProceedings of the ACM/IEEE 42nd international conference on software engineering. 186–197

  77. [77]

    Rahul Krishna Yandrapally and Ali Mesbah. 2022. Fragment-based test generation for web apps.IEEE Transactions on Software Engineering49, 3 (2022), 1086–1101

  78. [78]

    Juyeon Yoon, Robert Feldt, and Shin Yoo. 2024. Intent-driven mobile gui testing with autonomous large language model agents. In2024 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 129–139

  79. [79]

    Shengcheng Yu, Chunrong Fang, Mingzhe Du, Yuchen Ling, Zhenyu Chen, and Zhendong Su. 2024. Practical non- intrusive GUI exploration testing with visual-based robotic arms. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  80. [80]

    Shengcheng Yu, Chunrong Fang, Xin Li, Yuchen Ling, Zhenyu Chen, and Zhendong Su. 2024. Effective, platform- independent gui testing via image embedding and reinforcement learning.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–27

Showing first 80 references.