PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
Koen Claessen and John Hughes
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
background 1representative citing papers
BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
citing papers explorer
-
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
-
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
- PBT-Bench: Benchmarking AI Agents on Property-Based Testing