WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
Cowpilot: a framework for autonomous and human-agent collaborative web navigation
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.