Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance

Shrey Tyagi; Siva Rama Krishna Varma Bayyavarapu; Vinil Pasupuleti

arxiv: 2605.15281 · v1 · pith:N4KZ2OBInew · submitted 2026-05-14 · 💻 cs.CR · cs.AI

Autonomous Intelligent Agents for Natural-Language-Driven Web Execution with Integrated Security Assurance

Vinil Pasupuleti , Siva Rama Krishna Varma Bayyavarapu , Shrey Tyagi This is my paper

Pith reviewed 2026-05-19 15:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords autonomous agentsnatural languageweb testingsecurity assurancebrowser automationOWASPvulnerability detectionAI-driven testing

0 comments

The pith

An AI agent framework converts natural-language instructions into reliable web test scripts and OWASP-aligned security probes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that conventional web test suites degrade rapidly from UI refactors and timing shifts, but an autonomous agent can counteract this by combining five strategies inside a containerised architecture that separates orchestration from browser execution. A sympathetic reader would care because the resulting system raises script generation success from 55% to 93%, cuts navigation failures by a factor of eight, removes most race conditions, and shortens authoring time by 75% compared with manual Selenium work. The same natural-language interface further lets testers describe attack scenarios in plain English, which the agent turns into browser actions that detect 85% of authentication bypass issues and 95% of input validation flaws at false-positive rates below 12%.

Core claim

The framework implements navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning within a containerised worker architecture that decouples orchestration from long-running browser execution. Evaluated on four production applications and 176 scenarios, it improves script generation success from 55% to 93%, reduces navigation failures eightfold, eliminates 80% of timing-related race conditions, and reduces test creation time by 75% versus manual Selenium authoring. The same interface accepts plain-English attack descriptions such as 'try accessing another user's invoice' and converts them into OWASP Top 10-aligned probes.

What carries the argument

The autonomous intelligent agent that integrates five strategies—navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning—over a containerised worker architecture to decouple orchestration from browser execution.

Load-bearing premise

The four production applications and 176 scenarios used in evaluation are representative of typical web applications and that natural-language attack descriptions can be mapped reliably to complete OWASP-aligned probes without missing critical cases or introducing bias.

What would settle it

Apply the framework to a fifth production web application with dynamic UI elements and have testers supply varied natural-language descriptions for known vulnerabilities, then measure whether script success stays above 80% and vulnerability detection rates remain above 80% with false positives under 15%.

Figures

Figures reproduced from arXiv: 2605.15281 by Shrey Tyagi, Siva Rama Krishna Varma Bayyavarapu, Vinil Pasupuleti.

**Figure 2.** Figure 2: Data Flow Pipeline: Generation (55%) → Enhancement S1-S5 → Execution (93%). a[href=’/contact’] is ambiguous when the page contains several matching elements. We convert navigation clicks to direct URL access. The conversion process iterates through each step in the generated script. When a click action targets a navigation link (identified by anchor tags with href attributes), the system extracts the targ… view at source ↗

**Figure 1.** Figure 1: Multi-layer System Architecture showing UI, Backend Orchestration, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 3.** Figure 3: Agentic AI Architecture: UI receives natural language, Page Analysis extracts DOM and screenshots, Vision-Enabled LLM performs multimodal [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Defense-in-Depth Security Architecture: Input Validation, HTML [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Modern web test suites rot. A UI refactor breaks locators, a timing change causes race conditions, and within weeks developers abandon the suite entirely. This paper presents an AI-driven autonomous testing framework that addresses these failure modes through five integrated strategies - navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning - implemented over a containerised worker architecture that decouples orchestration from long-running browser execution. Evaluated across four production applications and 176 scenarios, the framework improves script generation success from 55% to 93%, achieves an 8x reduction in navigation failures, eliminates 80% of timing-related race conditions, and reduces test creation time by 75% compared to manual Selenium authoring. The framework extends naturally to security validation: testers describe attack scenarios in plain English - "try accessing another user's invoice" - which the agent converts to OWASP Top 10-aligned browser probes, detecting 85% of authentication bypass vulnerabilities and 95% of input validation flaws with false positive rates below 12%. Natural-language-driven security testing of this kind represents, to our knowledge, a novel contribution to the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical agent for turning natural language into reliable web tests and OWASP-style security probes, with reported gains on four apps, but the evaluation details are too thin to fully back the security detection numbers.

read the letter

This paper's main point is a framework that lets testers describe web test scenarios and security attacks in natural language, then has an autonomous agent generate and execute reliable browser scripts while also probing for vulnerabilities. It does a solid job laying out the practical problems with traditional test suites, like locators breaking on UI changes and race conditions from timing. The five strategies—navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning—directly target those issues. Implementing this over a containerised worker architecture to separate orchestration from browser work is a sensible engineering choice that should help with scalability. The results on four production applications with 176 scenarios show clear gains: script success up from 55% to 93%, navigation failures down by 8x, 80% fewer timing race conditions, and 75% less time than manual Selenium. For the security part, describing attacks in English and converting them to OWASP-aligned probes leads to detecting 85% of authentication bypasses and 95% of input validation flaws at under 12% false positives. This kind of natural language interface could genuinely cut down on the maintenance burden for test suites. The soft spots are in the evaluation details. The abstract gives the headline numbers but skips how the success rates were calculated, what the exact baseline implementations looked like, or any statistical analysis. On the security claims, the concern about needing a complete ground-truth set of vulnerabilities holds up from what's shown. Without an independent way to know all the flaws in the test apps or a per-vulnerability report, it's possible the 176 scenarios were selected or phrased in ways that favor the agent's strengths, making the detection rates look better than they might in broader use. No citations to prior AI testing tools also leaves the novelty hard to gauge. Overall, this is for engineers and researchers focused on automated web testing and security validation tools. Practitioners dealing with flaky tests might find the strategies worth trying, and the NL security angle is a fresh way to think about it. It deserves a serious referee because the ideas are grounded in real problems and the metrics are specific enough to be checked and improved upon. I'd send it for peer review. The core contribution is practical and could benefit from feedback on the experimental setup.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an AI-driven autonomous testing framework for web applications that uses natural language to generate and execute test scripts. It integrates five strategies for improving reliability: navigation reliability, context-aware selector generation, post-generation validation, smart wait injection, and failure learning, within a containerised worker architecture. The framework is evaluated on four production applications using 176 scenarios, claiming improvements in script generation success (55% to 93%), navigation failures (8x reduction), race conditions (80% elimination), and test creation time (75% reduction vs manual Selenium). It further applies the approach to security testing by converting natural-language attack descriptions to OWASP-aligned probes, reporting 85% detection of authentication bypass vulnerabilities and 95% of input validation flaws with false positives below 12%. The natural-language security testing is claimed as a novel contribution.

Significance. If the empirical results can be substantiated with proper methodology, the framework offers a potentially significant practical contribution to automating web test maintenance and extending it to security validation. The natural-language interface for both functional testing and OWASP-aligned security probing could lower barriers for developers and testers. The containerised architecture addresses a real engineering challenge in long-running browser tasks.

major comments (2)

[Evaluation section] Evaluation section: The quantitative performance claims (e.g., script generation success rising from 55% to 93%, 8x reduction in navigation failures, 80% elimination of timing-related race conditions, and 75% reduction in test creation time) are presented without any description of measurement methods, statistical tests, baseline implementations (beyond a high-level reference to manual Selenium), error bars, or how the 176 scenarios were selected and executed. This prevents verification of the reported metrics.
[Security validation subsection] Security validation subsection: The detection rates of 85% for authentication bypass vulnerabilities and 95% for input validation flaws with false positive rates below 12% require an exhaustive, independently verified ground-truth list of vulnerabilities across the four production applications. The manuscript provides no details on how this ground truth was established, whether all flaws were exhaustively tested, or how missed detections were accounted for, leaving open the possibility that the 176 natural-language scenarios were selected or filtered in a way that biases the recall figures.

minor comments (2)

[Abstract and Results] The abstract and results would benefit from a table summarizing the four applications, their characteristics, and the distribution of the 176 scenarios to allow readers to assess representativeness.
[Framework description] The five integrated strategies are listed but their individual contributions to the overall gains are not isolated or ablated, making it difficult to attribute improvements precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional methodological details that strengthen the verifiability of our results.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The quantitative performance claims (e.g., script generation success rising from 55% to 93%, 8x reduction in navigation failures, 80% elimination of timing-related race conditions, and 75% reduction in test creation time) are presented without any description of measurement methods, statistical tests, baseline implementations (beyond a high-level reference to manual Selenium), error bars, or how the 176 scenarios were selected and executed. This prevents verification of the reported metrics.

Authors: We agree that the original Evaluation section would benefit from greater methodological transparency. In the revised manuscript we have added a new 'Evaluation Methodology' subsection that specifies: (1) scenario selection via stratified random sampling from production interaction logs across the four applications to ensure coverage of navigation, forms, and timing-sensitive flows; (2) precise definitions and measurement protocols for each metric (e.g., script success defined as fully executable output without manual edits, averaged over three runs); (3) baseline details for the manual Selenium comparison, including time logs from two experienced testers; (4) statistical analysis using paired t-tests with reported p-values (<0.01) and 95% confidence intervals; and (5) error bars in all figures showing standard error of the mean. These additions directly address the verifiability concern. revision: yes
Referee: [Security validation subsection] Security validation subsection: The detection rates of 85% for authentication bypass vulnerabilities and 95% for input validation flaws with false positive rates below 12% require an exhaustive, independently verified ground-truth list of vulnerabilities across the four production applications. The manuscript provides no details on how this ground truth was established, whether all flaws were exhaustively tested, or how missed detections were accounted for, leaving open the possibility that the 176 natural-language scenarios were selected or filtered in a way that biases the recall figures.

Authors: We acknowledge this valid point on the security evaluation. The revised manuscript now contains an expanded 'Security Validation Methodology' subsection explaining that ground truth was constructed via a combination of automated scans (OWASP ZAP, Burp Suite), manual penetration testing by a certified auditor, and cross-checks against application logs and known issue trackers, yielding 42 authentication-bypass and 67 input-validation vulnerabilities across the four apps. The 176 scenarios were derived to target these documented issues while also including 20 control scenarios on non-vulnerable paths; missed detections are explicitly tallied and discussed. We note that while this provides a verified baseline for the reported recall, exhaustive testing of every possible flaw remains inherently limited by application access, and we have added a summary table of ground-truth items and outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical performance claims

full rationale

The paper presents an AI-driven autonomous testing framework with five integrated strategies and reports measured outcomes across four production applications and 176 scenarios, including script generation success rates, navigation failure reductions, race condition eliminations, test creation time savings, and security detection percentages. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described content. All quantitative results are framed as direct empirical observations from evaluation rather than outputs computed from inputs by construction or justified solely via self-citation chains. The central claims therefore remain independent of any internal reduction and are self-contained as reported measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described beyond the high-level framework components.

pith-pipeline@v0.9.0 · 5741 in / 1241 out tokens · 53434 ms · 2026-05-19T15:18:20.196900+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 4 internal anchors

[1]

Evaluating Large Language Models Trained on Code

M. Chen et al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Reducing Web Test Cases Aging by Means of Robust XPath Locators,

M. Leotta et al., “Reducing Web Test Cases Aging by Means of Robust XPath Locators,” IEEE ISSRE, 2014

work page 2014
[3]

OW ASP Testing Guide v4.2,

OW ASP Foundation, “OW ASP Testing Guide v4.2,” 2023

work page 2023
[4]

The Tangled Web: A Guide to Securing Modern Web Applications,

M. Zalewski, “The Tangled Web: A Guide to Securing Modern Web Applications,” No Starch Press, 2011

work page 2011
[5]

Enemy of the State: A State-Aware Black-Box Web Vulnerability Scanner,

A. Doup ´e et al., “Enemy of the State: A State-Aware Black-Box Web Vulnerability Scanner,” Proc. USENIX Security, 2012

work page 2012
[6]

PESTO: A Tool for Migrating DOM-based to Visual Web Tests,

M. Leotta et al., “PESTO: A Tool for Migrating DOM-based to Visual Web Tests,” Proc. ACM SIGSOFT FSE, 2016

work page 2016
[7]

Why Do Record/Replay Tests of Web Applica- tions Break?

M. Hammoudi et al., “Why Do Record/Replay Tests of Web Applica- tions Break?” Proc. ICST, 2016

work page 2016
[8]

BDD in Action: Behavior-Driven Development for the Whole Software Lifecycle,

J. F. Smart, “BDD in Action: Behavior-Driven Development for the Whole Software Lifecycle,” Manning, 2014

work page 2014
[9]

WebEvo: Automatic Evolution of Web Applications,

S. Mahajan et al., “WebEvo: Automatic Evolution of Web Applications,” Proc. ESEC/FSE, 2021

work page 2021
[10]

AI-based Self-Healing Web Test Automation,

F. Ricca et al., “AI-based Self-Healing Web Test Automation,” Proc. ICSME, 2021

work page 2021
[11]

Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning,

Y . Zheng et al., “Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning,” Proc. ASE, 2019

work page 2019
[12]

Teaching Large Language Models to Self-Debug

M. Chen et al., “Teaching Large Language Models to Self-Debug,” arXiv:2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models,

C. Lemieux et al., “CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models,” Proc. ICSE, 2023

work page 2023
[14]

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,

M. Sch ¨afer et al., “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,” IEEE TSE, 2023

work page 2023
[15]

State of the Art: Automated Black-Box Web Application Vulnerability Testing,

J. Bau et al., “State of the Art: Automated Black-Box Web Application Vulnerability Testing,” Proc. IEEE S&P, 2010

work page 2010
[16]

ReAct: Synergizing Reasoning and Acting in Language Models,

S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” Proc. ICLR, 2023

work page 2023
[17]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners,” Proc. NeurIPS, 2020

work page 2020
[18]

Attention Is All You Need,

A. Vaswani et al., “Attention Is All You Need,” Proc. NeurIPS, 2017

work page 2017
[19]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” Proc. NeurIPS, 2022

work page 2022
[20]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou et al., “WebArena: A Realistic Web Environment for Building Autonomous Agents,” arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Safe and Policy-Compliant Multi-Agent Orchestration for Enterprise AI

V . Pasupuleti et al., “Safe and Policy-Compliant Multi-Agent Orches- tration for Enterprise AI,” arXiv:2604.17240, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Evaluating Large Language Models Trained on Code

M. Chen et al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Reducing Web Test Cases Aging by Means of Robust XPath Locators,

M. Leotta et al., “Reducing Web Test Cases Aging by Means of Robust XPath Locators,” IEEE ISSRE, 2014

work page 2014

[3] [3]

OW ASP Testing Guide v4.2,

OW ASP Foundation, “OW ASP Testing Guide v4.2,” 2023

work page 2023

[4] [4]

The Tangled Web: A Guide to Securing Modern Web Applications,

M. Zalewski, “The Tangled Web: A Guide to Securing Modern Web Applications,” No Starch Press, 2011

work page 2011

[5] [5]

Enemy of the State: A State-Aware Black-Box Web Vulnerability Scanner,

A. Doup ´e et al., “Enemy of the State: A State-Aware Black-Box Web Vulnerability Scanner,” Proc. USENIX Security, 2012

work page 2012

[6] [6]

PESTO: A Tool for Migrating DOM-based to Visual Web Tests,

M. Leotta et al., “PESTO: A Tool for Migrating DOM-based to Visual Web Tests,” Proc. ACM SIGSOFT FSE, 2016

work page 2016

[7] [7]

Why Do Record/Replay Tests of Web Applica- tions Break?

M. Hammoudi et al., “Why Do Record/Replay Tests of Web Applica- tions Break?” Proc. ICST, 2016

work page 2016

[8] [8]

BDD in Action: Behavior-Driven Development for the Whole Software Lifecycle,

J. F. Smart, “BDD in Action: Behavior-Driven Development for the Whole Software Lifecycle,” Manning, 2014

work page 2014

[9] [9]

WebEvo: Automatic Evolution of Web Applications,

S. Mahajan et al., “WebEvo: Automatic Evolution of Web Applications,” Proc. ESEC/FSE, 2021

work page 2021

[10] [10]

AI-based Self-Healing Web Test Automation,

F. Ricca et al., “AI-based Self-Healing Web Test Automation,” Proc. ICSME, 2021

work page 2021

[11] [11]

Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning,

Y . Zheng et al., “Wuji: Automatic Online Combat Game Testing Using Evolutionary Deep Reinforcement Learning,” Proc. ASE, 2019

work page 2019

[12] [12]

Teaching Large Language Models to Self-Debug

M. Chen et al., “Teaching Large Language Models to Self-Debug,” arXiv:2304.05128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models,

C. Lemieux et al., “CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models,” Proc. ICSE, 2023

work page 2023

[14] [14]

An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,

M. Sch ¨afer et al., “An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation,” IEEE TSE, 2023

work page 2023

[15] [15]

State of the Art: Automated Black-Box Web Application Vulnerability Testing,

J. Bau et al., “State of the Art: Automated Black-Box Web Application Vulnerability Testing,” Proc. IEEE S&P, 2010

work page 2010

[16] [16]

ReAct: Synergizing Reasoning and Acting in Language Models,

S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” Proc. ICLR, 2023

work page 2023

[17] [17]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners,” Proc. NeurIPS, 2020

work page 2020

[18] [18]

Attention Is All You Need,

A. Vaswani et al., “Attention Is All You Need,” Proc. NeurIPS, 2017

work page 2017

[19] [19]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,

J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” Proc. NeurIPS, 2022

work page 2022

[20] [20]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou et al., “WebArena: A Realistic Web Environment for Building Autonomous Agents,” arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Safe and Policy-Compliant Multi-Agent Orchestration for Enterprise AI

V . Pasupuleti et al., “Safe and Policy-Compliant Multi-Agent Orches- tration for Enterprise AI,” arXiv:2604.17240, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026