WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
DeepSlide introduces a multi-agent system for full presentation preparation that matches baselines on slide quality but improves narrative flow, pacing, and script synergy via a new dual-scoreboard benchmark.
citing papers explorer
-
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
-
DeepSlide: From Artifacts to Presentation Delivery
DeepSlide introduces a multi-agent system for full presentation preparation that matches baselines on slide quality but improves narrative flow, pacing, and script synergy via a new dual-scoreboard benchmark.