Koen Claessen and John Hughes

URLhttps://arxiv · 2025 · arXiv 2508.09101

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

cs.AI · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

cs.SE · 2026-05-13 · 2 refs

citing papers explorer

Showing 4 of 4 citing papers.

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization cs.SE · 2026-05-13 · unverdicted · none · ref 6
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD cs.AI · 2026-05-11 · unverdicted · none · ref 10 · 2 links
BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents cs.AI · 2026-04-19 · unverdicted · none · ref 7
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
PBT-Bench: Benchmarking AI Agents on Property-Based Testing cs.SE · 2026-05-13 · unreviewed · ref 3 · 2 links

Koen Claessen and John Hughes

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer