Koen Claessen and John Hughes

URLhttps://arxiv · 2025 · arXiv 2508.09101

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

cs.SE · 2026-05-13 · unverdicted · novelty 7.0 · 3 refs

PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

cs.AI · 2026-05-11 · unverdicted · novelty 7.0 · 2 refs

BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

cs.SE · 2026-05-28 · unverdicted · novelty 6.0

Code-QA-Bench uses an answer-first pipeline and three-condition experiments to generate 628 tasks across 10 Python repositories and quantify that code access drives most performance gains while documentation adds only modest benefit on doc-dependent tasks.

citing papers explorer

Showing 5 of 5 citing papers.

PBT-Bench: Benchmarking AI Agents on Property-Based Testing cs.SE · 2026-05-13 · unverdicted · none · ref 3 · 3 links
PBT-Bench is a new benchmark with 100 property-based testing problems across 40 Python libraries that measures LLM bug recall rates of 42.1-83.4% under guided prompting versus 31.4-76.7% in baseline.
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization cs.SE · 2026-05-13 · unverdicted · none · ref 6
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD cs.AI · 2026-05-11 · unverdicted · none · ref 10 · 2 links
BenchCAD benchmark shows frontier multimodal models recover coarse geometry but fail to produce accurate parametric CAD programs for industrial parts, with limited generalization after fine-tuning.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents cs.AI · 2026-04-19 · unverdicted · none · ref 7
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA cs.SE · 2026-05-28 · unverdicted · none · ref 5
Code-QA-Bench uses an answer-first pipeline and three-condition experiments to generate 628 tasks across 10 Python repositories and quantify that code access drives most performance gains while documentation adds only modest benefit on doc-dependent tasks.

Koen Claessen and John Hughes

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer