InteractWeb-Bench shows that frontier multimodal AI agents remain trapped in blind execution when generating websites from perturbed, low-quality non-expert instructions.
Ensure you don’t mix them up with other numbers (e.g
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
Agents learn to dynamically construct and organize memory from multimodal experiences, improving performance over static designs in task-dependent settings.
citing papers explorer
-
InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?
InteractWeb-Bench shows that frontier multimodal AI agents remain trapped in blind execution when generating websites from perturbed, low-quality non-expert instructions.
-
WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks
WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.
-
Learning to Learn from Multimodal Experience
Agents learn to dynamically construct and organize memory from multimodal experiences, improving performance over static designs in task-dependent settings.