pith. sign in

arxiv: 2604.18224 · v1 · submitted 2026-04-20 · 💻 cs.SE · cs.AI

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Pith reviewed 2026-05-10 04:36 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords web coding evaluationmultimodal benchmarkcode language modelsweb engineeringLLM judgeagent judgeinteractive web applications
0
0 comments X p. Extension

The pith

WebCompass provides a multimodal benchmark for evaluating code language models on full web engineering lifecycles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current benchmarks for code LLMs in web coding only assess narrow parts like text-based generation with basic correctness checks. This leaves out important aspects such as how well the output looks visually, how it interacts with users, and how well models can edit or fix existing codebases. WebCompass fills this gap by creating a benchmark that includes text, image, and video inputs for tasks involving generating, editing, and repairing web code. It organizes these into seven categories that reflect real professional web development processes. The evaluation uses advanced methods where an agent runs the generated code in a browser to test it automatically.

Core claim

We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. It spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Instances cover 15 generation domains, 16 editing operation types, and 11 repair defect types, each at Easy/Medium/Hard levels. Evaluation adopts checklist-guided LLM-as-a-Judge for editing and repair, and Agent-as-a-Judge for generation that executes sites in a real browser, explores behaviors, and synthesizes test cases. Evaluations of models show closed-source models are stronger and more balanced,编辑

What carries the argument

The WebCompass benchmark with its seven task categories across modalities and the Agent-as-a-Judge protocol that autonomously tests generated websites in a browser using the Model Context Protocol to approximate human acceptance testing.

If this is right

  • Closed-source models remain substantially stronger and more balanced in web coding tasks compared to open-source models.
  • Editing and repair tasks show different difficulty profiles, with repair better preserving interactivity but being more challenging in execution.
  • Aesthetics is the most persistent bottleneck, particularly for open-source models.
  • Framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on the task type.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained with this benchmark in mind could develop better capabilities for iterative web development processes.
  • The Agent-as-a-Judge approach might be extended to evaluate other interactive software outputs beyond websites.
  • Future benchmarks in software engineering could adopt similar multimodal and lifecycle-spanning designs to better match real-world use.

Load-bearing premise

The LLM-as-a-Judge and Agent-as-a-Judge evaluation methods accurately measure quality in ways that match what human experts would accept in professional web engineering.

What would settle it

Conducting a study with human web developers rating the same set of model outputs and finding low agreement with the automated judge scores would challenge the benchmark's validity.

Figures

Figures reproduced from arXiv: 2604.18224 by Chenchen Zhang, Chenyu Zhou, Dailin Li, Han Li, Haoyang Huang, Hongyi Ye, Jiaheng Liu, Jinhua Hao, Junqi Xiong, Ken Deng, Letian Zhu, Minghao Liu, Ming Sun, Xinping Lei, Xinyu Che, Yifan Yao, Yukai Huang, Zhaoxiang Zhang, Zizheng Zhan.

Figure 1
Figure 1. Figure 1: Radar chart of model performance across all seven task types in WebCompass. Text-Guided Generation Vision-Guided Generation Video-Guided Generation Text-Guided Editing Vision-Guided Editing Diagnostic Repair Visual-Diagnostic Repair 0 50 100 150 200 250 300 350 Number of Tasks 123 109 94 300 300 300 300 Generation Generation Editing Repair Edit Repair Easy Medium Hard 33 34 37 118 118 90 90 48 30 44 114 11… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of WebCompass. The benchmark supports three input modalities (text, image, video) [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data construction pipeline for WebCompass. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of the LLM-as-a-Judge evaluation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Agent-as-a-Judge evaluation pipeline. The MCP bridge enables bidirectional communication: [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of model rankings between agent-based automatic evaluation and human eval [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overall scores across front-end frameworks for four representative models on Generation, [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall score breakdown for editing tasks across 16 operation types. Scores are computed as the [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall score breakdown for repair tasks across 11 defect categories. Scores are computed [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance comparison across Generation, Editing, and Repair tasks by difficulty level. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Generation task: per-dimension scores (Runnability, Spec Implementation, and Design Quality) [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Edit task: per-dimension scores (Instruction Targeting, Feature Integrity, and Style Confor [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Repair task: per-dimension scores (Root-Cause Targeting, Interaction Integrity, and Reference [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Distribution of patch complexity across models. Top row: Edit tasks; bottom row: Repair [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Consistency & Stability: Score Degradation under Worst-of-N [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Overall error distribution across all evaluated models on web generation tasks. Feature [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Error distribution by input modality. Text-conditioned generation is dominated by functional [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Quantitative distribution of error types for [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Quantitative distribution of error types for [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
read the original abstract

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WebCompass, a multimodal benchmark for unified lifecycle evaluation of code LLMs on web engineering. It spans three input modalities (text/image/video) and three task types (generation/editing/repair) to produce seven categories mirroring professional workflows. A human-in-the-loop pipeline curates instances across 15 generation domains, 16 editing operations, and 11 repair defects, each labeled Easy/Medium/Hard. Evaluation employs a checklist-guided LLM-as-a-Judge for editing/repair and a novel Agent-as-a-Judge for generation that executes sites in a browser, uses MCP to explore interactions, and synthesizes tests. Experiments on closed- and open-source models yield four observations: closed-source models are stronger and more balanced; editing and repair show distinct profiles; aesthetics remains the main bottleneck; and framework choice (e.g., Vue vs. React/Vanilla) affects results.

Significance. If the judge protocols are validated against human raters, WebCompass would provide a valuable advance over existing narrow web-coding benchmarks by measuring visual fidelity, interactivity, and iterative repair in addition to static correctness. The Agent-as-a-Judge paradigm, which autonomously generates targeted tests via real-browser execution, is a concrete methodological contribution that could be reused in other agentic coding evaluations. The human-curated coverage of domains and defect types also supplies a reusable resource for the community.

major comments (2)
  1. [Evaluation protocols] Evaluation section (Agent-as-a-Judge and LLM-as-a-Judge protocols): the claim that these protocols 'closely approximate human acceptance testing' is unsupported by any reported inter-rater agreement, expert correlation, or validation-subset results against human engineers. This directly undermines interpretation of all four experimental observations, especially claims about aesthetics as the persistent bottleneck and differences in interactivity preservation between editing and repair.
  2. [Benchmark curation] Benchmark construction and curation pipeline: no quantitative details are given on how data exclusions were decided, what inter-rater agreement was achieved during human annotation of Easy/Medium/Hard levels, or how many instances were discarded. These omissions affect the reliability of the 15/16/11 category counts and the difficulty stratification used to support the difficulty-profile findings.
minor comments (2)
  1. [Results] The abstract and results section should include at least one table or figure that reports per-model scores broken down by the seven task categories rather than only high-level aggregates, to allow readers to verify the 'distinct difficulty profiles' claim.
  2. [Evaluation protocols] Notation for the Model Context Protocol (MCP) and the exact checklist items used by the LLM-as-a-Judge should be defined in a dedicated subsection or appendix so that the evaluation can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and positive assessment of WebCompass's potential contributions. We address the major comments point-by-point below, proposing specific revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation protocols] Evaluation section (Agent-as-a-Judge and LLM-as-a-Judge protocols): the claim that these protocols 'closely approximate human acceptance testing' is unsupported by any reported inter-rater agreement, expert correlation, or validation-subset results against human engineers. This directly undermines interpretation of all four experimental observations, especially claims about aesthetics as the persistent bottleneck and differences in interactivity preservation between editing and repair.

    Authors: We agree that explicit validation against human raters would strengthen the claims. The protocols were designed to approximate human acceptance testing through structured checklists for LLM-as-a-Judge and real-browser execution with automated test synthesis for Agent-as-a-Judge. However, the manuscript does not include quantitative agreement metrics. In the revision, we will add a new subsection reporting results from a human validation study on a representative subset (e.g., 100 instances), including inter-rater agreement (Cohen's kappa) between the judges and expert human engineers, as well as correlation with human acceptance decisions. This will directly support the four observations. We believe this addresses the concern without altering the core findings. revision: yes

  2. Referee: [Benchmark curation] Benchmark construction and curation pipeline: no quantitative details are given on how data exclusions were decided, what inter-rater agreement was achieved during human annotation of Easy/Medium/Hard levels, or how many instances were discarded. These omissions affect the reliability of the 15/16/11 category counts and the difficulty stratification used to support the difficulty-profile findings.

    Authors: We acknowledge the need for greater transparency in the curation process. The human-in-the-loop pipeline involved multiple annotators, but specific quantitative details such as inter-annotator agreement and discard rates were not reported. In the revised manuscript, we will expand the benchmark construction section to include: (1) the total number of candidate instances collected, (2) criteria and numbers for exclusions at each stage, (3) inter-rater agreement statistics (e.g., Fleiss' kappa) for the Easy/Medium/Hard annotations across the 15/16/11 categories, and (4) final counts after curation. These details are available from our annotation logs and will be added to ensure the reliability of the difficulty profiles. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction and empirical evaluation are self-contained

full rationale

The paper presents WebCompass as a new multimodal benchmark spanning generation/editing/repair tasks across modalities, with human-in-the-loop curation and two proposed judge protocols (checklist-guided LLM-as-Judge and Agent-as-a-Judge using browser execution + MCP). No equations, fitted parameters, or first-principles derivations appear; the central claims are definitional descriptions of the benchmark construction and empirical observations on existing models. The 'mirrors professional workflows' framing is an interpretive assertion supported by curation process rather than any reduction to self-citation or input-by-construction. Evaluation results (model rankings, difficulty profiles) are direct measurements on the introduced data, not predictions derived from within-paper fits. This is the standard non-circular pattern for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central contribution rests on assumptions about the reliability of LLM judges and browser-based agents to proxy human evaluation, plus the human curation pipeline; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Checklist-guided LLM-as-a-Judge reliably evaluates editing and repair quality
    Core to the evaluation protocol for two of the three task types.
  • domain assumption Agent-as-a-Judge with browser execution and MCP approximates human acceptance testing for generated websites
    Central to the novel generation evaluation method.

pith-pipeline@v0.9.0 · 5662 in / 1413 out tokens · 31851 ms · 2026-05-10T04:36:50.864712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references

  1. [1]

    Does the page crash or show blocking console errors? -> E1.x

  2. [2]

    Does a required feature not work as specified? -> E2.x

  3. [3]

    Does it work but look wrong? -> E3.x

  4. [4]

    checklist_id

    Does it look right but have non-functional issues? -> E4.x == Points Allocation Rule == When a single checklist item has multiple issues mentioned in reason: - Allocate points proportionally to issue severity - If unclear, split evenly among identified issues - Critical failures get more points than minor issues == Output Format == Return a JSON array: [{...

  5. [5]

    sum(points_deducted) must equal max_score - score for each item

  6. [6]

    Each error gets exactly ONE type code

  7. [7]

    Full-score items -> empty errors array

  8. [8]

    id":1, "task

    Multiple distinct issues -> separate error objects Error Analysis Prompt (Part 2: Few-Shot Examples) == Example 1: Runtime error == Input: {"id":1, "task":"Page loads correctly", "max_score":5, "score":0, "reason":"Uncaught ReferenceError: initApp is not defined. Page shows white screen."} Output: [{"checklist_id":1, "task":"Page loads correctly", "score"...

  9. [9]

    , "expected_result

    Check for red error messages 4. Check Network for 404/500", "expected_result": "Page loads completely, no Console errors, no failed requests, all static resources load successfully", "criteria": "Full 10 pts; JS errors -5; Resource 404 -3; White screen = 0; Warnings do not deduct", "max_score": 10} ### 2. Spec Implementation (6-10 items, worth 60-70 point...

  10. [10]

    , "expected_result

    Verify highlights and move indicators 4. Click valid square to confirm move with animation", "expected_result": "Piece highlighted, dots on empty squares, rings on captures, smooth ~200ms animation", "criteria": "Full 12; no highlight -4; no indicators -4; no animation -2; wrong move -2", "max_score": 12}, {"task": "Verify dark theme with correct primary/...

  11. [11]

    , "expected_result

    Verify dark gradient background 3. Check primary/accent colors 4. Verify semi-transparent cards", "expected_result": "Dark gradient bg, wood-brown primary, royal blue interactive, gold accents, cohesive dark theme", "criteria": "Full 5; light theme -5; inconsistent colors -2", "max_score": 5}, ... (10 more items omitted for brevity)] Query: --- [QUERY] --...

  12. [12]

    Your entire response MUST be pure Markdown text

  13. [13]

    ABSOLUTELY NO explanations, no extra commentary

  14. [14]

    Every file MUST be emitted using the following format: # path/to/file.ext ```ext <full file content> ```

  15. [15]

    The heading line MUST start with'#'followed by the file path (relative path)

  16. [16]

    The code fence language MUST match the file type

  17. [17]

    Include all necessary files so the project can run

  18. [18]

    utf-8" /> <title>Demo</title> <link rel=

    Do NOT nest triple backticks inside code blocks. Few-shot examples: # index.html ```html <!doctype html> <html> <head> <meta charset="utf-8" /> <title>Demo</title> <link rel="stylesheet" href="styles.css" /> </head> <body> Hello <script type="module" src="main.js"></script> </body> </html> ``` # styles.css ```css body { font-family: system-ui; } ``` # mai...

  19. [19]

    TEMPORAL SEQUENCE ANALYSIS: - Study frame progression to understand user interactions - Identify animation sequences, timing, and easing patterns - Map state transitions and user feedback mechanisms - Recognize loading states, hover effects, micro-interactions - Document exact timing and duration of animations

  20. [20]

    VISUAL DESIGN EXTRACTION: - Extract precise color values (prefer hex codes: #RRGGBB) - Identify typography: families, sizes, weights, line heights - Measure spacing: margins, padding, gaps (use rem/em units) - Analyze shadows: box-shadow values, blur, spread, inset - Document border radius, opacity, and gradient effects - Note z-index layering and stackin...

  21. [21]

    LAYOUT & STRUCTURE ANALYSIS: - Identify layout systems: Flexbox, CSS Grid, or positioning - Map responsive breakpoints and mobile adaptations - Document component hierarchy and nesting structure - Analyze alignment, distribution, and spacing patterns

  22. [22]

    "> </search_replace> tags - The`path`attribute must specify the relative file path (e.g.,

    INTERACTION PATTERN RECOGNITION: - Button states: normal, hover, active, focus, disabled - Animation triggers: click, hover, scroll, load events - State management: data flow and component updates - User feedback: visual confirmations and error states Video-Guided Generation Prompt (Part 2: Implementation) TECHNICAL IMPLEMENTATION REQUIREMENTS: HTML5 STRU...

  23. [23]

    Task Instructions: multi-line text, each line follows: Task <idx> - <task_type>: <description>

  24. [24]

    Generated Code Modifications: the search/replace blocks

  25. [25]

    Original UI Screenshot: the before-modification state

  26. [26]

    task_scores

    Modified UI Screenshot: the after-modification visual result ## Evaluation Framework Score each task independently across three dimensions (0-10): - Instruction Targeting: Patch applicability and task-attempt coverage - Feature Integrity: Whether original and new functionality is correct - Style Conformance: Visual quality and consistency with original st...

  27. [27]

    Defect Description: multi-line text, each line follows: Defect <idx> - <task_type>: <description>

  28. [28]

    Ground-Truth Code Modifications: the ideal fix (reference)

  29. [29]

    Generated Code Modifications: the produced fix

  30. [30]

    Before-Fix UI Screenshot: defective state (red box markers)

  31. [31]

    After-Fix UI Screenshot: the actual repair result

  32. [32]

    task_scores

    Ground-Truth Fixed UI Screenshot: the ideal fix result ## Evaluation Framework Score each defect repair independently (0-10 per dimension): - Root-Cause Targeting: Patch applicability and root-cause localization - Interaction Integrity: Whether original and repaired functionality is correct - Reference Fidelity: Visual quality vs. ground-truth reference #...

  33. [33]

    Absolutely do not modify/fix the original website project code

  34. [34]

    The only content you are allowed to create/modify is: - checklist.json - Screenshot files in the image/ directory

  35. [35]

    Before all tasks are completed, you must call tools for verification in every round

    Complete all tasks in a single run. Before all tasks are completed, you must call tools for verification in every round. Agent-as-a-Judge: Verification Prompt (Part 2: Execution Flow) ======================== Mandatory Execution Flow (No Steps May Be Skipped) ======================== Step 0: Prepare Output Directory

  36. [36]

    Ensure image/ folder exists in the project directory

  37. [37]

    Step 1: Conduct Code Review First (Read-Only)

    Take screenshots for every key state verification. Step 1: Conduct Code Review First (Read-Only)

  38. [38]

    Read repository code related to page entry points, routing, interactions, requests, and error handling

  39. [39]

    Compile verifiable points: entry URLs, key buttons/forms, potential error points, data sources and loading logic

  40. [40]

    Step 2: Read checklist.json

    Code review only guides test paths; final scores must be based on actual webpage behavior. Step 2: Read checklist.json

  41. [41]

    Locate all entries where score is null

  42. [42]

    Step 3: Open and Actually Test the Webpage

    Extract task / operation_sequence / expected_result. Step 3: Open and Actually Test the Webpage

  43. [43]

    Use mcp__chrome-devtools for interactive verification 35 (clicking, typing, navigating, scrolling, etc.)

  44. [44]

    For aesthetics tasks, combine UI screenshots for scoring

    For each item: perform operations, observe expectations, take screenshots as evidence. For aesthetics tasks, combine UI screenshots for scoring. Step 4: Immediately Write Back to Checklist

  45. [45]

    - reason: Single-line string with reproducible evidence

    After each verification, write back to checklist.json: - score: Change from null to a definitive score. - reason: Single-line string with reproducible evidence. ======================== Key Rule: Entry Point Failure => Cascading Failure ======================== If the website entry point is unavailable (blank screen/crash/ infinite loading): take screensh...