{"total":13,"items":[{"citing_arxiv_id":"2606.05920","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement","primary_cat":"cs.SE","submitted_at":"2026-06-04T09:24:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Asuka-Bench is a new benchmark of 50 web tasks with 784 criteria that evaluates 8 LLMs in 2 frameworks on multi-round refinement, finding a 38-point spread in weighted task pass rate and a top score of only 52% after three rounds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00750","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"I-WebGenBench : Evaluating Interactivity in LLM-Generated Scientific Web Applications","primary_cat":"cs.CL","submitted_at":"2026-05-30T14:34:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A Paper-to-Interactive-System Agent and I-WebGenBench benchmark with 19 papers enable converting scientific PDFs into executable interactive web systems, with PaperVoyager framework shown to improve quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30000","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation","primary_cat":"cs.AI","submitted_at":"2026-05-28T14:30:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Cookie-Bench is a reference-free 1,000-query web development benchmark paired with Cookie-Frame, a metacognition-inspired three-stage framework (static perception, agent interaction, dynamic scoring) that aligns with human ratings on 13 frontier LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26807","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML","primary_cat":"cs.SE","submitted_at":"2026-05-26T10:22:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HTMLCure uses browser-executed interaction trajectories to diagnose and repair LLM HTML outputs, expanding 97K prompts into a 40K refined SFT set that lifts a 27B model to 50.6 on HTMLBench-400 and 81.2 on MiniAppBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17637","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games","primary_cat":"cs.AI","submitted_at":"2026-05-17T20:07:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% excellent rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17526","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering","primary_cat":"cs.SE","submitted_at":"2026-05-17T16:15:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17439","ref_index":10,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents","primary_cat":"cs.SE","submitted_at":"2026-05-17T13:22:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% on RealDevBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17242","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements","primary_cat":"cs.SE","submitted_at":"2026-05-17T03:48:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07442","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GameGen-Verifier: Parallel Keypoint-Based Verification for LLM-Generated Games via Runtime State Injection","primary_cat":"cs.LG","submitted_at":"2026-05-08T08:46:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GameGen-Verifier decomposes game specifications into keypoints, injects runtime states for targeted checks, and achieves 92.2% accuracy on 100 games while running up to 16.6x faster than agent-based baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2023. [29] Yuxuan Lu, Bingsheng Yao, Hansu Gu, Jing Huang, Zheshen Jessie Wang, Yang Li, Jiri Gesi, Qi He, Toby Jia-Jun Li, and Dakuo Wang. Uxagent: An llm agent-based usability testing framework for web design. InProceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA '25, pages 1-12. ACM, April 2025. [30] Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Webgen-bench: Evaluating llms on generating interactive and functional websites from scratch.arXiv preprint arXiv:2505.03733, 2025. [31] MiniMax. Minimax m2.1: Significantly enhanced multi-language programming, built for real-world complex tasks."},{"citing_arxiv_id":"2605.04637","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies","primary_cat":"cs.MA","submitted_at":"2026-05-06T08:30:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering quality, and security weaknesses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30358","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"QASM-Eval: A Dataset to Train and Evaluate LLMs on OpenQASM-3 Beyond Quantum Circuits","primary_cat":"cs.LG","submitted_at":"2026-04-28T19:01:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces QASM-Eval, the first dataset targeting OpenQASM-3 hardware-facing features for LLM training and evaluation, with an extended verifier for syntax, states, and timelines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20398","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-04-22T10:04:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"enables the model to learn relative preferences among candidate outputs for the same prompt, even when absolute reward scales vary substantially across tasks. The KL term DKL constrains the policy from drifting excessively from the reference model, thereby preserving the syntactic and linguistic quality of the generated code. 4 Experiments 4.1 Experimental Setup Datasets and Benchmarks.We use WebGen-Instruct [ 28] as our training corpus. It comprises 6,667 end-to-end website generation tasks covering a broad range of real-world web application domains. For evaluation, we use the WebGen-Bench [28] benchmark, which contains 101 carefully curated website generation tasks ranging from minimalist portfolios to complex data-driven dash- boards, including e-commerce frontends and real-time trackers."},{"citing_arxiv_id":"2604.15309","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation","primary_cat":"cs.CV","submitted_at":"2026-04-16T17:59:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MM-WebAgent is a hierarchical multimodal agent that coordinates AIGC tools through planning and iterative self-reflection to generate coherent, visually consistent webpages and outperforms baselines on a new benchmark.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"vergence or a maximum of 3 iterations. We compareMM-WebAgentwith both code generation-based and agent- basedbaselinesonMM-WebGEN-Bench.Codegeneration-basedmethodsgen- erate webpages in an end-to-end code generation paradigm, while agent-based baselines are implemented usingbolt.diy[22] orOpenhands[27]. The evalu- ated models include: OpenAI-GPT 4o [19], OpenAI-GPT 5mini [16], OpenAI- GPT 5 [15], OpenAI-GPT 5.1 [17], Qwen2.5-Coder-7B-Inst [9], Qwen2.5-Coder- 32B-Inst [9], Qwen3-Coder-30B-A3B-Inst [25], and Qwen2.5-72B-Inst [24], and Gemini-2.5-Pro [3]. All evaluations are conducted three times and reported as the mean and standard deviation. 4.2 Main Results Paradigm Comparison onMM-WebGEN-Bench.We evaluateMM- WebAgentunder different webpage generation paradigms onMM-WebGEN-"}],"limit":50,"offset":0}