InteractWeb-Bench shows that frontier multimodal AI agents remain trapped in blind execution when generating websites from perturbed, low-quality non-expert instructions.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
WebVoyager uses a large multimodal model to complete real-world web tasks end-to-end and reaches 59.1 percent success on a new benchmark of 15 live sites, with an automatic GPT-4V evaluator that matches human judgments 85 percent of the time.
Agents learn to dynamically construct and organize memory from multimodal experiences, improving performance over static designs in task-dependent settings.
citing papers explorer
-
InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?
InteractWeb-Bench shows that frontier multimodal AI agents remain trapped in blind execution when generating websites from perturbed, low-quality non-expert instructions.
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models
WebVoyager uses a large multimodal model to complete real-world web tasks end-to-end and reaches 59.1 percent success on a new benchmark of 15 live sites, with an automatic GPT-4V evaluator that matches human judgments 85 percent of the time.
-
Learning to Learn from Multimodal Experience
Agents learn to dynamically construct and organize memory from multimodal experiences, improving performance over static designs in task-dependent settings.