arxiv: 2604.27419 · v1 · submitted 2026-04-30 · 💻 cs.AI · cs.CL

Recognition: unknown

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Qiyao Wang , Haoran Hu , Longze Chen , Hongbo Wang , Hamid Alinejad-Rokny , Yuan Lin , Min Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords multimodal agentswebsite generationinteractive benchmarkblind executionintent recognitionuser simulationMLLM agentsrequirement defects

0 comments

The pith

Frontier multimodal agents remain trapped in blind execution when generating websites from ambiguous user instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces InteractWeb-Bench as the first interactive benchmark designed to test multimodal large language model agents on website generation tasks under realistic non-expert conditions. Real development often starts with vague, redundant, or contradictory instructions that create semantic misalignment, a failure mode the authors call blind execution. The benchmark simulates these conditions through four types of user agents applying persona-driven perturbations based on requirement engineering defects, plus an environment with actions for clarification, implementation, verification, and submission. Experiments demonstrate that current frontier agents do not use these actions effectively to refine intent or adapt, instead proceeding with incorrect assumptions.

Core claim

The paper establishes that frontier MLLM-based agents remain trapped in blind execution, a state where they fail to resolve ambiguities or contradictions in user instructions through iterative clarification and visual verification, even when provided with an interactive environment and unified action space.

What carries the argument

The InteractWeb-Bench interactive execution environment, featuring a unified action space of Clarify, Implement, Verify, and Submit together with four simulated user agent types that generate persona-driven instruction perturbations drawn from requirement engineering defect taxonomies.

If this is right

Current MLLM agents exhibit clear limitations in intent recognition that prevent effective use of clarification and verification actions.
Interactive feedback mechanisms alone do not enable agents to escape blind execution under ambiguous or contradictory instructions.
Website generation by agents will require advances in adaptive interaction to handle real-world non-expert user inputs.
Benchmarks must move beyond idealized, well-structured inputs to reflect actual development constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same simulation approach of perturbed instructions could be extended to other interactive agent tasks such as data visualization or mobile app creation to test robustness more broadly.
Models trained or fine-tuned specifically on defect-based perturbations might show measurable gains in intent alignment on this benchmark.
The results point toward the need for hybrid workflows where human oversight handles initial ambiguity resolution until agent intent recognition improves.

Load-bearing premise

The four simulated user agent types and their persona-driven instruction perturbations accurately represent the semantic misalignment that occurs between ambiguous non-expert instructions and model understanding in real website development.

What would settle it

A clear falsifier would be if frontier agents in the benchmark environment consistently use the Clarify action to resolve ambiguities, then produce implementations that pass Verify checks and match the original user intent across multiple perturbed sessions.

read the original abstract

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing benchmarks rely on idealized assumptions, especially for well-structured, information-rich inputs and static execution settings. In contrast, real-world development is constrained by a critical bottleneck: the semantic misalignment between ambiguous, low-quality instructions from non-expert users and model understanding, which results in a failure mode that we term blind execution. To address this gap, we introduce InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code user conditions. InteractWeb-Bench introduces four types of user agents and persona-driven instruction perturbations to systematically simulate diverse user behaviors, including ambiguity, redundancy, and contradiction, grounded in requirement engineering defect taxonomies. We develop an interactive execution environment for agents, featuring a unified action space comprising Clarify, Implement, Verify, and Submit, enabling iterative intent refinement, code synthesis, and visual feedback-based validation. Extensive experiments and analysis reveal that frontier MLLM-based agents remain trapped in blind execution, exposing limitations in intent recognition and adaptive interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a benchmark for testing MLLM agents on ambiguous non-expert instructions in website generation, but the headline claim about agents being trapped in blind execution depends on unvalidated synthetic user models.

read the letter

The core contribution is InteractWeb-Bench, which adds an interactive loop and four persona-based user agents to simulate messy instructions drawn from requirement engineering defect categories like ambiguity and contradiction. The environment gives agents a clean action space of Clarify, Implement, Verify, and Submit plus visual feedback, which moves past the static, perfect-input setups common in earlier coding benchmarks. That setup is a reasonable way to surface intent-recognition failures, and the experiments apparently document that current frontier models still default to blind execution rather than asking for clarification. Those results line up with practical experience in agent-assisted web work and give the field some concrete failure cases to measure against. The main weakness is the missing link between the synthetic perturbations and actual non-expert users. The paper grounds the perturbations in established taxonomies, yet without side-by-side comparison to real collected queries, inter-rater checks with developers, or ablations that show the perturbations drive the observed failures in a way that matches field data, the “trapped” result could be an artifact of the chosen distribution. If the full paper includes those checks or at least reports raw metrics and baselines clearly, the concern shrinks; otherwise it stays central. This work is aimed at researchers building or evaluating coding agents for web tasks. Anyone running agent benchmarks or thinking about deployment constraints will find the interactive framing and documented failure modes worth reading. It is solid enough to merit peer review, mainly because new interactive benchmarks are still scarce and the questions it raises are practical. I would send it out, but flag the validation gap for reviewers to examine.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces InteractWeb-Bench, the first multimodal interactive benchmark for website generation under non-expert low-code conditions. It defines four user-agent types and persona-driven instruction perturbations (grounded in requirement-engineering defect taxonomies) to simulate ambiguity, redundancy, and contradiction. The benchmark supplies an interactive execution environment with a unified action space (Clarify, Implement, Verify, Submit) that supports iterative intent refinement, code synthesis, and visual-feedback validation. Experiments on frontier MLLM-based agents conclude that they remain trapped in a failure mode termed 'blind execution,' exposing limitations in intent recognition and adaptive interaction.

Significance. If the benchmark's perturbations are shown to faithfully reproduce real-world semantic misalignment, the work would usefully identify a practical limitation of current multimodal agents in interactive coding tasks and supply a new evaluation framework that departs from static, information-rich benchmarks. The introduction of an interactive action space and the grounding in established defect taxonomies are constructive elements.

major comments (1)

[InteractWeb-Bench construction and user-agent definition] The central claim that frontier agents are 'trapped in blind execution' as a general limitation rests on the assertion that the four user-agent types and persona-driven perturbations accurately simulate semantic misalignment between ambiguous non-expert instructions and model understanding. No external validation is reported (e.g., comparison against a corpus of real non-expert developer queries, inter-rater agreement with practicing web developers, or an ablation demonstrating that removing any perturbation type alters agent behavior in a manner matching observed real-world failure modes). This is load-bearing for the headline result.

minor comments (2)

[Abstract] The abstract states that 'extensive experiments and analysis reveal' the trapping result but supplies no quantitative metrics, model names, success rates, baseline comparisons, or error breakdowns; readers must reach the experimental section to evaluate the strength of the evidence.
[Introduction / Benchmark overview] The term 'blind execution' is introduced as a new failure mode; a concise operational definition (e.g., percentage of submissions without clarification requests despite detectable ambiguity) would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying a key aspect of our benchmark construction. We address the major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [InteractWeb-Bench construction and user-agent definition] The central claim that frontier agents are 'trapped in blind execution' as a general limitation rests on the assertion that the four user-agent types and persona-driven perturbations accurately simulate semantic misalignment between ambiguous non-expert instructions and model understanding. No external validation is reported (e.g., comparison against a corpus of real non-expert developer queries, inter-rater agreement with practicing web developers, or an ablation demonstrating that removing any perturbation type alters agent behavior in a manner matching observed real-world failure modes). This is load-bearing for the headline result.

Authors: We agree that external validation would strengthen the generalizability of the headline claim. The four user-agent types and perturbations were systematically derived from established requirement-engineering defect taxonomies (covering ambiguity, redundancy, and contradiction) rather than being ad hoc; this grounding is described in Section 3 of the manuscript. However, the current version does not include direct comparisons to real non-expert query corpora, inter-rater agreement metrics, or explicit ablation results on individual perturbation categories. In the revised manuscript we will (1) expand Section 3 to provide a more detailed mapping from each taxonomy category to the implemented perturbation rules, (2) add a dedicated Limitations subsection that explicitly acknowledges the absence of real-world corpus validation and inter-rater studies, and (3) include a brief discussion of how future work could perform such validation. We maintain that the controlled, taxonomy-grounded simulation still yields reproducible evidence of blind execution across frontier agents, but we will qualify the scope of the general-limitation claim accordingly. revision: partial

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with no derivations or self-referential fits

full rationale

The paper introduces InteractWeb-Bench as an empirical evaluation framework for MLLM agents in website generation. It defines four user-agent types and persona-driven perturbations grounded in external requirement-engineering taxonomies, then runs experiments in a custom interactive environment with a fixed action space. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The central claim (agents remain in blind execution) is an observed outcome of the benchmark runs rather than a quantity forced by construction from the benchmark definition itself. The absence of external validation for ecological validity is a separate methodological concern, not an instance of circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that simulated user perturbations capture real non-expert behavior and that blind execution is a distinct, measurable failure mode.

axioms (1)

domain assumption Persona-driven instruction perturbations grounded in requirement engineering defect taxonomies can systematically simulate real-world user ambiguity, redundancy, and contradiction.
This assumption enables the benchmark to test intent misalignment and is invoked to justify the four user agent types.

invented entities (1)

blind execution no independent evidence
purpose: To label the specific failure mode in which agents perform code synthesis without resolving semantic misalignment via interaction.
New term introduced to describe the observed agent behavior in the interactive setting.

pith-pipeline@v0.9.0 · 5523 in / 1216 out tokens · 99503 ms · 2026-05-07T09:11:29.017358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 3 canonical work pages

[1]

URLhttps://arxiv.org/abs/2107.03374. T. Y. Chen, S. C. Cheung, and S. M. Yiu. Metamorphic testing: A new approach for generating next test cases, 2020. URLhttps://arxiv.org/abs/2002.12543. DeepSeek-AI. Deepseek-v3.2: Pushing the frontier of open large language models.ArXiv, abs/2512.02556, 2025. URLhttps://api.semanticscholar.org/CorpusID:283448719. H. P....

work page doi:10.1145/3695988 2020
[2]

URLhttps://api.semanticscholar.org/CorpusID:261697361. 11 InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation? Yaxi Lu, Shenzhi Yang, Cheng Qian, Guirong Chen, Qinyu Luo, Yesai Wu, Huadong Wang, Xin Cong, Zhong Zhang, Yankai Lin, Weiwen Liu, Yasheng Wang, Zhiyuan Liu, Fangming Liu, and Maosong Sun. Proactive ag...

work page doi:10.18653/v1/2025.naacl-long.199 2024
[3]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

doi: 10.52202/079017-1601. URLhttps://proceedings.neurips.cc/paper_files /paper/2024/file/5a7c947568c1b1328ccc5230172e1e7c-Paper-Conference.pdf. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizingreasoningandactinginlanguagemodels. InTheEleventhInternationalConference on Learning Representations...

work page doi:10.52202/079017-1601 2024
[4]

artifact

Action Types: ... (Omit) NEVER use the word "artifact". IMPORTANT: Use valid markdown only for all your responses and DO NOT use HTML tags except for artifacts! ULTRA IMPORTANT: Think first. Decide whether to ASK or CODE. If coding, reply with the artifact immediately. Here are some examples of correct usage of artifacts: <examples> ... </examples> Figure...
[5]

Delete existing content in a textbox and then type content
[6]

Multiple scrolls are allowed to browse the webpage

Scroll up or down. Multiple scrolls are allowed to browse the webpage. Pay attention!! The default scroll is the whole window. If the scroll widget is located in a certain area of the webpage, then you have to specify a Web Element in that area. I would hover the mouse there and then scroll
[7]

Typically used to wait for unfinished webpage processes, with a duration of 5 seconds

Wait. Typically used to wait for unfinished webpage processes, with a duration of 5 seconds
[8]

Go back, returning to the previous webpage
[9]

This action should only be chosen when all questions in the task have been solved

Answer. This action should only be chosen when all questions in the task have been solved. Correspondingly, Action should STRICTLY follow the format: - Click [Numerical_Label] - Type [Numerical_Label]; [Content] - Scroll [Numerical_Label or WINDOW]; [up or down] - Wait - GoBack - ANSWER; [content] Key Guidelines You MUST follow:Action guidelines
[10]

After typing, the system automatically hits ‘ENTER‘ key

To input text, NO need to click textbox first, directly type content. After typing, the system automatically hits ‘ENTER‘ key. Sometimes you should click the search button to apply search filters. Try to use simple language when searching
[11]

You must Distinguish between textbox and search button, don’t type content into the button! If no textbox is found, you may need to click the search button first before the textbox is displayed
[12]

Execute only one action per iteration
[13]

You may have selected the wrong web element or numerical label

STRICTLY Avoid repeating the same action if the webpage remains unchanged. You may have selected the wrong web element or numerical label. Continuous use of the Wait is also NOT allowed
[14]

Flexibly combine your own abilities with the information in the web page

When a complex Task involves multiple questions or steps, select "ANSWER" only at the very end, after addressing all of these questions (steps). Flexibly combine your own abilities with the information in the web page. Double check the formatting requirements in the task when ANSWER. Web Browsing Guidelines
[15]

Pay attention to Key Web Elements like search textbox and menu

Don’t interact with useless web elements like Login, Sign-in, donation that appear in Webpages. Pay attention to Key Web Elements like search textbox and menu
[16]

Clicking to download PDF is allowed and will be analyzed by the Assistant API

Vsit video websites like YouTube is allowed BUT you can’t play videos. Clicking to download PDF is allowed and will be analyzed by the Assistant API
[17]

Ensure you don’t mix them up with other numbers (e.g

Focus on the numerical labels in the TOP LEFT corner of each rectangle (element). Ensure you don’t mix them up with other numbers (e.g. Calendar) on the page
[18]

It may be necessary to find the correct year, month and day at calendar

Focus on the date in task, you must look for results that match the date. It may be necessary to find the correct year, month and day at calendar
[19]

ArtiMuse

Pay attention to the filter and sort functions on the page, which, combined with scroll, can help you solve conditions like ’highest’, ’cheapest’, ’lowest’, ’earliest’, etc. Try your best to find the answer that best fits the task. Your reply should strictly follow the format: Thought: Your brief thoughts (briefly summarize the info that will help ANSWER)...
[20]

Objective Defect (has_visual_bug) [Boolean]: Are there obvious UI rendering failures (text overlapping, container overflow, broken images, unstyled HTML, severely misaligned elements)?
[21]

Visual Layout (visual_layout_score) [Scale: 1-5]:
[22]

Flawless, pixel-perfect alignment, masterful contrast and spacing
[23]

Minor spacing inconsistencies

Good and functional, but uses standard/generic layouts. Minor spacing inconsistencies
[24]

Visibly unbalanced, awkward whitespace, or clashing colors

Mediocre. Visibly unbalanced, awkward whitespace, or clashing colors
[25]

Elements feel randomly placed but usable

Poor structure. Elements feel randomly placed but usable
[26]

Completely broken layout
[27]

Creative Alignment (creative_alignment_score) [Scale: 1-5]:
[28]

Looks like a top-tier premium theme

Highly unique, artistic, and emotionally engaging. Looks like a top-tier premium theme
[29]

Cohesive and on-theme, but slightly predictable or template-like
[30]

Generic, boring, or lacks a distinct visual identity
[31]

Colors and typography do not match the intended topic

Confused theme. Colors and typography do not match the intended topic
[32]

Barebones default styling

Zero creativity. Barebones default styling
[33]

Overall Aesthetics (overall_aesthetics_score) [Scale: 1-5]:
[34]

Breathtaking overall visual impact
[35]

Visually pleasing and professional
[36]

Average, looks like a beginner’s draft
[37]

Unattractive, hard to look at
[38]

reasoning

Visually offensive or completely broken. CRITICAL: You MUST output ONLY a valid JSON object. To enforce step-by-step thinking, you MUST output the "reasoning" key FIRST, before assigning any scores. Match this exact structure: { "reasoning": "<Be extremely critical. Explicitly mention why it didn’t get a 5, justify any low scores, and analyze each of the ...