PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
Web-bench: A LLM code benchmark based on web standards and frameworks.CoRR, abs/2505.07473, 2025
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.SE 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.
citing papers explorer
-
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
-
Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging
VF-Coder raises GUI code success rate from 21.68% to 28.29% and visual score from 0.4284 to 0.5584 on a new 984-task benchmark by adding direct visual perception and interaction.
-
RubberDuckBench: A Benchmark for AI Coding Assistants
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.