Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance
Pith reviewed 2026-05-19 15:18 UTC · model grok-4.3
The pith
An AI agent framework converts natural-language instructions into reliable web test scripts and OWASP-aligned security probes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The framework implements navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning within a containerised worker architecture that decouples orchestration from long-running browser execution. Evaluated on four production applications and 176 scenarios, it improves script generation success from 55% to 93%, reduces navigation failures eightfold, eliminates 80% of timing-related race conditions, and reduces test creation time by 75% versus manual Selenium authoring. The same interface accepts plain-English attack descriptions such as 'try accessing another user's invoice' and converts them into OWASP Top 10-aligned probes.
What carries the argument
The autonomous intelligent agent that integrates five strategies—navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning—over a containerised worker architecture to decouple orchestration from browser execution.
Load-bearing premise
The four production applications and 176 scenarios used in evaluation are representative of typical web applications and that natural-language attack descriptions can be mapped reliably to complete OWASP-aligned probes without missing critical cases or introducing bias.
What would settle it
Apply the framework to a fifth production web application with dynamic UI elements and have testers supply varied natural-language descriptions for known vulnerabilities, then measure whether script success stays above 80% and vulnerability detection rates remain above 80% with false positives under 15%.
Figures
read the original abstract
Modern web test suites rot. A UI refactor breaks locators, a timing change causes race conditions, and within weeks developers abandon the suite entirely. This paper presents an AI-driven autonomous testing framework that addresses these failure modes through five integrated strategies - navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning - implemented over a containerised worker architecture that decouples orchestration from long-running browser execution. Evaluated across four production applications and 176 scenarios, the framework improves script generation success from 55% to 93%, achieves an 8x reduction in navigation failures, eliminates 80% of timing-related race conditions, and reduces test creation time by 75% compared to manual Selenium authoring. The framework extends naturally to security validation: testers describe attack scenarios in plain English - "try accessing another user's invoice" - which the agent converts to OWASP Top 10-aligned browser probes, detecting 85% of authentication bypass vulnerabilities and 95% of input validation flaws with false positive rates below 12%. Natural-language-driven security testing of this kind represents, to our knowledge, a novel contribution to the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an AI-driven autonomous testing framework for web applications that uses natural language to generate and execute test scripts. It integrates five strategies for improving reliability: navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning, within a containerised worker architecture. The framework is evaluated on four production applications using 176 scenarios, claiming improvements in script generation success (55% to 93%), navigation failures (8x reduction), race conditions (80% elimination), and test creation time (75% reduction vs manual Selenium). It further applies the approach to security testing by converting natural-language attack descriptions to OWASP-aligned probes, reporting 85% detection of authentication bypass vulnerabilities and 95% of input validation flaws with false positives below 12%. The natural-language security testing is claimed as a novel contribution.
Significance. If the empirical results can be substantiated with proper methodology, the framework offers a potentially significant practical contribution to automating web test maintenance and extending it to security validation. The natural-language interface for both functional testing and OWASP-aligned security probing could lower barriers for developers and testers. The containerised architecture addresses a real engineering challenge in long-running browser tasks.
major comments (2)
- [Evaluation section] Evaluation section: The quantitative performance claims (e.g., script generation success rising from 55% to 93%, 8x reduction in navigation failures, 80% elimination of timing-related race conditions, and 75% reduction in test creation time) are presented without any description of measurement methods, statistical tests, baseline implementations (beyond a high-level reference to manual Selenium), error bars, or how the 176 scenarios were selected and executed. This prevents verification of the reported metrics.
- [Security validation subsection] Security validation subsection: The detection rates of 85% for authentication bypass vulnerabilities and 95% for input validation flaws with false positive rates below 12% require an exhaustive, independently verified ground-truth list of vulnerabilities across the four production applications. The manuscript provides no details on how this ground truth was established, whether all flaws were exhaustively tested, or how missed detections were accounted for, leaving open the possibility that the 176 natural-language scenarios were selected or filtered in a way that biases the recall figures.
minor comments (2)
- [Abstract and Results] The abstract and results would benefit from a table summarizing the four applications, their characteristics, and the distribution of the 176 scenarios to allow readers to assess representativeness.
- [Framework description] The five integrated strategies are listed but their individual contributions to the overall gains are not isolated or ablated, making it difficult to attribute improvements precisely.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional methodological details that strengthen the verifiability of our results.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: The quantitative performance claims (e.g., script generation success rising from 55% to 93%, 8x reduction in navigation failures, 80% elimination of timing-related race conditions, and 75% reduction in test creation time) are presented without any description of measurement methods, statistical tests, baseline implementations (beyond a high-level reference to manual Selenium), error bars, or how the 176 scenarios were selected and executed. This prevents verification of the reported metrics.
Authors: We agree that the original Evaluation section would benefit from greater methodological transparency. In the revised manuscript we have added a new 'Evaluation Methodology' subsection that specifies: (1) scenario selection via stratified random sampling from production interaction logs across the four applications to ensure coverage of navigation, forms, and timing-sensitive flows; (2) precise definitions and measurement protocols for each metric (e.g., script success defined as fully executable output without manual edits, averaged over three runs); (3) baseline details for the manual Selenium comparison, including time logs from two experienced testers; (4) statistical analysis using paired t-tests with reported p-values (<0.01) and 95% confidence intervals; and (5) error bars in all figures showing standard error of the mean. These additions directly address the verifiability concern. revision: yes
-
Referee: [Security validation subsection] Security validation subsection: The detection rates of 85% for authentication bypass vulnerabilities and 95% for input validation flaws with false positive rates below 12% require an exhaustive, independently verified ground-truth list of vulnerabilities across the four production applications. The manuscript provides no details on how this ground truth was established, whether all flaws were exhaustively tested, or how missed detections were accounted for, leaving open the possibility that the 176 natural-language scenarios were selected or filtered in a way that biases the recall figures.
Authors: We acknowledge this valid point on the security evaluation. The revised manuscript now contains an expanded 'Security Validation Methodology' subsection explaining that ground truth was constructed via a combination of automated scans (OWASP ZAP, Burp Suite), manual penetration testing by a certified auditor, and cross-checks against application logs and known issue trackers, yielding 42 authentication-bypass and 67 input-validation vulnerabilities across the four apps. The 176 scenarios were derived to target these documented issues while also including 20 control scenarios on non-vulnerable paths; missed detections are explicitly tallied and discussed. We note that while this provides a verified baseline for the reported recall, exhaustive testing of every possible flaw remains inherently limited by application access, and we have added a summary table of ground-truth items and outcomes. revision: yes
Circularity Check
No circularity: purely empirical performance claims
full rationale
The paper presents an AI-driven autonomous testing framework with five integrated strategies and reports measured outcomes across four production applications and 176 scenarios, including script generation success rates, navigation failure reductions, race condition eliminations, test creation time savings, and security detection percentages. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described content. All quantitative results are framed as direct empirical observations from evaluation rather than outputs computed from inputs by construction or justified solely via self-citation chains. The central claims therefore remain independent of any internal reduction and are self-contained as reported measurements.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
M. Chen et al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Reducing Web Test Cases Aging by Means of Robust XPath Locators,
M. Leotta et al., “Reducing Web Test Cases Aging by Means of Robust XPath Locators,” IEEE ISSRE, 2014
work page 2014
- [3]
-
[4]
The Tangled Web: A Guide to Securing Modern Web Applications,
M. Zalewski, “The Tangled Web: A Guide to Securing Modern Web Applications,” No Starch Press, 2011
work page 2011
-
[5]
Enemy of the State: A State-Aware Black-Box Web Vulnerability Scanner,
A. Doup ´e et al., “Enemy of the State: A State-Aware Black-Box Web Vulnerability Scanner,” Proc. USENIX Security, 2012
work page 2012
-
[6]
PESTO: A Tool for Migrating DOM-based to Visual Web Tests,
M. Leotta et al., “PESTO: A Tool for Migrating DOM-based to Visual Web Tests,” Proc. ACM SIGSOFT FSE, 2016
work page 2016
-
[7]
Why Do Record/Replay Tests of Web Applica- tions Break?
M. Hammoudi et al., “Why Do Record/Replay Tests of Web Applica- tions Break?” Proc. ICST, 2016
work page 2016
-
[8]
BDD in Action: Behavior-Driven Development for the Whole Software Lifecycle,
J. F. Smart, “BDD in Action: Behavior-Driven Development for the Whole Software Lifecycle,” Manning, 2014
work page 2014
-
[9]
WebEvo: Automatic Evolution of Web Applications,
S. Mahajan et al., “WebEvo: Automatic Evolution of Web Applications,” Proc. ESEC/FSE, 2021
work page 2021
-
[10]
AI-based Self-Healing Web Test Automation,
F. Ricca et al., “AI-based Self-Healing Web Test Automation,” Proc. ICSME, 2021
work page 2021
-
[11]
Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning,
Y . Zheng et al., “Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning,” Proc. ASE, 2019
work page 2019
-
[12]
Teaching Large Language Models to Self-Debug
M. Chen et al., “Teaching Large Language Models to Self-Debug,” arXiv:2304.05128, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models,
C. Lemieux et al., “CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models,” Proc. ICSE, 2023
work page 2023
-
[14]
An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,
M. Sch ¨afer et al., “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,” IEEE TSE, 2023
work page 2023
-
[15]
State of the Art: Automated Black-Box Web Application Vulnerability Testing,
J. Bau et al., “State of the Art: Automated Black-Box Web Application Vulnerability Testing,” Proc. IEEE S&P, 2010
work page 2010
-
[16]
ReAct: Synergizing Reasoning and Acting in Language Models,
S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” Proc. ICLR, 2023
work page 2023
-
[17]
Language Models are Few-Shot Learners,
T. Brown et al., “Language Models are Few-Shot Learners,” Proc. NeurIPS, 2020
work page 2020
-
[18]
A. Vaswani et al., “Attention Is All You Need,” Proc. NeurIPS, 2017
work page 2017
-
[19]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,
J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” Proc. NeurIPS, 2022
work page 2022
-
[20]
WebArena: A Realistic Web Environment for Building Autonomous Agents
S. Zhou et al., “WebArena: A Realistic Web Environment for Building Autonomous Agents,” arXiv:2307.13854, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Safe and Policy-Compliant Multi-Agent Orchestration for Enterprise AI
V . Pasupuleti et al., “Safe and Policy-Compliant Multi-Agent Orches- tration for Enterprise AI,” arXiv:2604.17240, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.