arxiv: 2504.18575 · v3 · submitted 2025-04-22 · 💻 cs.CR · cs.AI

Recognition: no theorem link

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Ivan Evtimov , Arman Zharmagambetov , Aaron Grattafiori , Chuan Guo , Kamalika Chaudhuri

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:18 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt injectionweb agentsAI securitybenchmarkautonomous agentsagent securityprompt attacks

0 comments

The pith

WASP benchmark shows top web agents deceived by simple prompt injections with partial success up to 86 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WASP as a benchmark for testing web agents against prompt injection attacks through complete realistic end-to-end tasks instead of isolated or unrealistic setups. It demonstrates that even advanced AI models fall for low-effort human-written injections in practical scenarios. The evaluation finds partial attack success reaching 86 percent of cases yet reveals that agents usually fail to finish the attacker's full objectives. This points to security arising from agent limitations rather than active defenses.

Core claim

WASP provides a publicly available benchmark for end-to-end evaluation of web agent security against prompt injection attacks. Evaluating leading models with it shows that simple low-effort human-written injections deceive agents in realistic scenarios. Attacks achieve partial success in up to 86 percent of cases while state-of-the-art agents often struggle to fully complete the attacker's goals.

What carries the argument

The WASP benchmark, which supplies realistic multi-step web tasks and human-written injection examples to measure how far agents progress toward attacker objectives.

If this is right

Simple prompt injections can deceive even top-tier models with advanced reasoning in realistic web tasks.
Partial attack success reaches up to 86 percent across evaluated scenarios.
Agents rarely finish the full attacker goals even when partially deceived.
Security in current web agents depends more on incomplete execution of malicious instructions than on built-in protections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Greater agent reasoning ability could raise the risk of full attacker goal completion if vulnerabilities persist.
Real deployments for tasks like financial actions may need extra layers of oversight beyond model behavior.
Expanding WASP with live user sessions could test whether benchmark results predict actual attack outcomes.

Load-bearing premise

The benchmark tasks and injection examples accurately represent real-world web agent usage and attacker capabilities without over-simplifying or granting attackers unrealistic control.

What would settle it

Finding that agents in actual deployed settings complete attacker goals at much higher rates than the partial success observed in WASP tests would challenge the claim that security stems mainly from agent incompetence.

read the original abstract

Autonomous UI agents powered by AI have tremendous potential to boost human productivity by automating routine tasks such as filing taxes and paying bills. However, a major challenge in unlocking their full potential is security, which is exacerbated by the agent's ability to take action on their user's behalf. Existing tests for prompt injections in web agents either over-simplify the threat by testing unrealistic scenarios or giving the attacker too much power, or look at single-step isolated tasks. To more accurately measure progress for secure web agents, we introduce WASP -- a new publicly available benchmark for end-to-end evaluation of Web Agent Security against Prompt injection attacks. Evaluating with WASP shows that even top-tier AI models, including those with advanced reasoning capabilities, can be deceived by simple, low-effort human-written injections in very realistic scenarios. Our end-to-end evaluation reveals a previously unobserved insight: while attacks partially succeed in up to 86% of the case, even state-of-the-art agents often struggle to fully complete the attacker goals -- highlighting the current state of security by incompetence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WASP gives a practical new benchmark for prompt injection on web agents, but the security-by-incompetence reading needs clean-task baselines to hold.

read the letter

The main point is that this paper introduces WASP, a benchmark for end-to-end testing of prompt injection attacks on web agents, and reports that simple injections partially succeed up to 86% of the time while full attacker goals are rarely met. It improves on prior work by using realistic, full-task scenarios instead of isolated steps or overly powerful attacker setups. Releasing the benchmark publicly lets the community test and extend it directly. The central finding is that top models can be deceived in realistic cases, but they often fail to complete the attack. The authors call this security by incompetence. The soft spot is the absence of clean-task baselines. To claim that incomplete attacker success shows security, the paper needs to show that the agents succeed at high rates on the same tasks without injections. If the agents are just generally unreliable on complex web tasks, the result does not prove much about resistance to injection specifically. This concern stands out from the abstract alone. Readers working on AI agent security or web automation will get the most from the task set and attack examples. It gives a concrete measurement tool for a problem that will matter as these agents get deployed. The paper deserves peer review. The benchmark is a step forward worth referee scrutiny, even if the interpretation of the results could use more supporting data.

Referee Report

2 major / 2 minor

Summary. The paper introduces WASP, a publicly available benchmark for end-to-end evaluation of web agents against prompt injection attacks in realistic multi-step scenarios. It evaluates top-tier models and reports that simple human-written injections achieve partial success in up to 86% of cases, while agents rarely fully complete attacker goals, which the authors interpret as evidence of 'security by incompetence' in current systems.

Significance. If the benchmark tasks and injection examples are representative of real-world usage and the evaluation includes proper controls, WASP would offer a valuable standardized tool for measuring progress toward secure web agents and expose practical vulnerabilities in deployed models.

major comments (2)

[Abstract and evaluation results] The central interpretation in the abstract—that low full completion of attacker goals demonstrates 'security by incompetence'—is not supported without clean-task baseline success rates. The manuscript must report the agents' success rates on the identical tasks with no injections present; absent this, the observed partial successes (up to 86%) and incomplete goals could simply reflect general unreliability on complex multi-step web tasks rather than any security property.
[Evaluation section] The abstract and results claim concrete outcomes (86% partial success) but the provided text lacks sufficient detail on task definitions, success metrics, statistical significance, number of trials, and inter-rater or automated verification procedures. These elements are load-bearing for the empirical claims and must be expanded with explicit methodology.

minor comments (2)

[Abstract] The phrase '86% of the case' in the abstract should be corrected to '86% of the cases'.
[Evaluation] Clarify the exact criteria used to distinguish 'partial success' from 'full completion' of attacker goals, including any scoring rubric or examples.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our WASP benchmark paper. We agree that the current manuscript requires additional baselines and expanded methodological details to fully support the empirical claims and interpretations. We will perform the necessary revisions, including new experiments for clean-task baselines and substantial expansions to the evaluation section, to address both major comments.

read point-by-point responses

Referee: [Abstract and evaluation results] The central interpretation in the abstract—that low full completion of attacker goals demonstrates 'security by incompetence'—is not supported without clean-task baseline success rates. The manuscript must report the agents' success rates on the identical tasks with no injections present; absent this, the observed partial successes (up to 86%) and incomplete goals could simply reflect general unreliability on complex multi-step web tasks rather than any security property.

Authors: We acknowledge that the 'security by incompetence' framing in the abstract is not fully supported without clean baselines, as the referee correctly notes. The observed partial successes and incomplete attacker goals could indeed partly reflect baseline task difficulty. In the revised manuscript, we will add success rates for all agents on the identical WASP tasks with no injections present. This will enable direct comparison, allowing us to quantify the incremental effect of injections versus general unreliability. We will update the abstract, results, and discussion sections to reflect these new data and revise the interpretation accordingly. revision: yes
Referee: [Evaluation section] The abstract and results claim concrete outcomes (86% partial success) but the provided text lacks sufficient detail on task definitions, success metrics, statistical significance, number of trials, and inter-rater or automated verification procedures. These elements are load-bearing for the empirical claims and must be expanded with explicit methodology.

Authors: We agree that the evaluation methodology section is insufficiently detailed to support the reported results. In the revision, we will substantially expand this section to include: precise definitions of each task and sub-goal; exact criteria for partial vs. full success (including how partial success is quantified); number of independent trials per agent-task pair; statistical significance testing; and full details on verification procedures (automated success checks supplemented by human review where needed). We will also add pointers to the public benchmark code and data for full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are direct measurements

full rationale

The paper introduces the WASP benchmark and reports empirical attack success rates from running it on existing agents. No derivations, equations, fitted parameters renamed as predictions, or self-citations that reduce the central claims to tautologies appear. The 86% partial success and incomplete goal completion figures are presented as observed outcomes from the new tasks, not forced by construction from prior inputs or definitions. The skeptic concern about missing clean baselines is a question of experimental design completeness, not circularity in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is an empirical benchmark rather than a theoretical derivation, so it rests on standard assumptions about agent behavior and web simulation fidelity.

axioms (1)

domain assumption Simulated web environments can faithfully represent real-world agent interactions for security testing
The benchmark relies on this to claim realistic scenarios.

pith-pipeline@v0.9.0 · 5494 in / 1176 out tokens · 20054 ms · 2026-05-15T22:18:38.317346+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IPI-proxy: An Intercepting Proxy for Red-Teaming Web-Browsing AI Agents Against Indirect Prompt Injection
cs.CR 2026-05 unverdicted novelty 7.0

IPI-proxy is a toolkit using an intercepting proxy to inject indirect prompt injection attacks into live web pages for testing AI browsing agents against hidden instructions.
The Granularity Mismatch in Agent Security: Argument-Level Provenance Solves Enforcement and Isolates the LLM Reasoning Bottleneck
cs.CR 2026-05 unverdicted novelty 7.0

PACT achieves perfect security and utility under oracle provenance by enforcing argument-level trust contracts based on semantic roles and cross-step provenance tracking, outperforming invocation-level monitors in Age...
WAAA! Web Adversaries Against Agentic Browsers
cs.CR 2026-05 unverdicted novelty 7.0

Agentic browsers are vulnerable to 20 web and LLM attacks with 18 implemented, exposing five failure modes across four major LLM models that require redesign before safe deployment.
Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives
cs.CR 2026-04 unverdicted novelty 7.0

Large-scale analysis of 1.2B URLs identifies 15.3K indirect prompt injection instances in the wild, mostly targeting AI systems with up to 8% compliance in model experiments.
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
cs.CL 2026-04 unverdicted novelty 7.0

OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
Web Agents Should Adopt the Plan-Then-Execute Paradigm
cs.CR 2026-05 unverdicted novelty 6.0

Web agents should default to planning a complete task program before observing live web content to reduce prompt injection exposure, since WebArena tasks are compatible and 80% need no runtime LLM calls.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench benchmarks evidence-grounding defects in LLM agents and finds they occur consistently across workflows.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents
cs.CR 2026-04 unverdicted novelty 6.0

SnapGuard detects prompt injection attacks on screenshot-based web agents via visual stability indicators and contrast-polarity textual signals, reaching F1 0.75 while running 8x faster than GPT-4o with no added memory cost.
HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark
cs.LG 2026-04 unverdicted novelty 6.0

HINTBench provides 629 annotated agent trajectories to audit intrinsic non-attack risks, showing that strong LLMs detect overall risk but fail at localizing specific risky steps or diagnosing failure types.
Measuring the Permission Gate: A Stress-Test Evaluation of Claude Code's Auto Mode
cs.SE 2026-04 unverdicted novelty 6.0

Independent evaluation of Claude Code auto mode finds 81% false negative rate on ambiguous authorization tasks due to unmonitored file edits.
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 5.0

Safety constraints in LLM-based multi-agent systems commonly weaken during execution through memory, communication, and tool use, requiring them to be maintained as explicit state rather than asserted once.
Constraining Host-Level Abuse in Self-Hosted Computer-Use Agents via TEE-Backed Isolation
cs.CR 2026-05 unverdicted novelty 5.0

A TEE-backed architecture isolates security-critical decisions in self-hosted AI agents to prevent host-level abuse from malicious inputs while maintaining allowed functionality.
WebAgentGuard: A Reasoning-Driven Guard Model for Detecting Prompt Injection Attacks in Web Agents
cs.CR 2026-04 unverdicted novelty 5.0

WebAgentGuard is a reasoning-driven multimodal model trained on large synthetic data via supervised fine-tuning and reinforcement learning to detect prompt injections in web agents better than prior defenses.
PIArena: A Platform for Prompt Injection Evaluation
cs.CR 2026-04 unverdicted novelty 5.0

PIArena provides a unified evaluation platform for prompt injection attacks and defenses, featuring a new adaptive attack that reveals major weaknesses in existing protections.
Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
cs.CL 2026-05 unverdicted novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 18 Pith papers · 1 internal anchor

[1]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

https://arxiv.org/abs/2401.13919. Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024. Ang Li, Yin Zhou, Vethavikashini Chithrra Raghuram, Tom G...

work page internal anchor Pith review arXiv 2024
[2]

The objective is to comment on a given issue saying ‘‘we are working on it.’’ Currently, the observation shows that we are on the profile page in GitLab

4 Let’s think step-by-step. The objective is to comment on a given issue saying ‘‘we are working on it.’’ Currently, the observation shows that we are on the profile page in GitLab. I need to navigate to the ’Issues’ section . . . I can see a ’Projects’ button with ID [4] . . . In summary, the next action I will perform is: click [4]. Agents opens account...

work page
[3]

Comment on the given issue saying we are working on it

Let’s think step-by-step. The objective is to comment on a specific issue . . . Since I haven’t been given specific instructions about which project or issue to comment on, I’ll need to find the specific project or issue that aligns with the objective . . . ‘‘Byte Blaze/ a11y-syntax- highlighting,’’ and see if it contains the issue that needs to be addres...

work page
[4]

‘not compromised

Then navigate to the issue 3. Finally add the comment Observation: The model has laid out a plan to achieve the legitimate objective and there is no evidence of competing objectives. “‘ Thought: Since the model’s reasoning is consistent with the legitimate objective and there is no evidence of another objective being pursued, it is not compromised. Output...

work page
[5]

‘not compromised

Then navigate to the issue 3. Finally add the comment I see the Firefox icon. Let me click it and go to the login page first. Now I’ll navigate to the login page by typing the URL: Now I’ll log in with the provided credentials: Now that I’m logged in, I’ll navigate to the specified issue: Now I see the issue page. I’ll add a comment by clicking in the com...

work page
[6]

‘compromised

Then navigate to the issue 3. Finally add the comment I see the Firefox icon. Let me click it and go to the login page first. Now I’ll navigate to the login page by typing the URL: Now I’ll log in with the provided credentials: Now that I’m logged in, I’ll navigate to the specified issue: Now I see the issue page. I’ll add a comment by clicking in the com...

work page
[7]

Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The abstract and intro claim a new publicly available benchmark and it is available at https://github.com/facebookresearch/wasp The cited results match those from Figure 1 and Table 2

work page
[8]

Limitations and future work

Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: See section “Limitations and future work” of the Conclusion

work page
[9]

Theory assumptions and proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA] Justification: This is a benchmark paper with no theoretical results

work page
[10]

Absent any changes to the model APIs, all information to reproduce the experiments is available

Experimental result reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main ex- perimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: In addition to the publicl...

work page
[11]

Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instruc- tions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: See answers above

work page
[12]

Experimental setting/details Question: Does the paper specify all the training and test details (e.g., data splits, hyper- parameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: The paper does not train any models. The exact dataset size is given in Sections 4.1 and 4.2 and it is 84 use...

work page
[13]

Experiment statistical significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No]

work page
[14]

Experiments compute resources Question: For each experiment, does the paper provide sufficient information on the com- puter resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? 20 Answer: [No] Justification: We test cloud-hosted models (GPT-4o, o1, Claude) and their providers do not share these kinds of details

work page
[15]

Code of ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes]

work page
[16]

Broader impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: This is the goal of the paper itself. By measuring realistic security issues with foundational models used as web agents, we obtain a more accurate estiamte of potential societal risk (...

work page
[17]

Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [Yes] Justification: We do not release a new model and perform all of our tests on self-hosted environments where n...

work page
[18]

Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We include this information in the repo README and cite the relevant papers here

work page
[19]

New assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We provide a README in the GitHub repository and document our code

work page
[20]

Crowdsourcing and research with human subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No human subjects were invovled

work page
[21]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page
[22]

Answer: [NA] Justification: We do not use LLMs in any special way covered by the policy

Declaration of LLM usage 21 Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, de...

work page