Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025

Kai Xu, YiWei Mao, XinYi Guan, ZiLong Feng · 2025 · arXiv 2505.07473

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

cs.SE · 2026-03-14 · unverdicted · novelty 7.0

VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.

RubberDuckBench: A Benchmark for AI Coding Assistants

cs.SE · 2026-01-23 · unverdicted · novelty 7.0

RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.

citing papers explorer

Showing 3 of 3 citing papers.

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization cs.SE · 2026-05-13 · unverdicted · none · ref 45
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging cs.SE · 2026-03-14 · unverdicted · none · ref 34
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
RubberDuckBench: A Benchmark for AI Coding Assistants cs.SE · 2026-01-23 · unverdicted · none · ref 38
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.

Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025

fields

years

verdicts

representative citing papers

citing papers explorer