WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

· 2026 · cs.SE · arXiv 2604.18224

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.

representative citing papers

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

Cookie-Bench is a reference-free 1,000-query web development benchmark paired with Cookie-Frame, a metacognition-inspired three-stage framework (static perception, agent interaction, dynamic scoring) that aligns with human ratings on 13 frontier LLMs.

I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications

cs.CL · 2026-05-30 · unverdicted · novelty 5.0

A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.

citing papers explorer

Showing 2 of 2 citing papers.

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation cs.AI · 2026-05-28 · unverdicted · none · ref 17 · internal anchor
Cookie-Bench is a reference-free 1,000-query web development benchmark paired with Cookie-Frame, a metacognition-inspired three-stage framework (static perception, agent interaction, dynamic scoring) that aligns with human ratings on 13 frontier LLMs.
I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications cs.CL · 2026-05-30 · unverdicted · none · ref 22 · internal anchor
A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

fields

years

verdicts

representative citing papers

citing papers explorer