ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
Beyond Synthetic Benchmarks.arXiv:2510.26130, October
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
VeriGraphi introduces a knowledge-graph-anchored multi-agent pipeline that produces reliable hierarchical synthesizable Verilog for complex designs such as RISC-V processors.
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.
citing papers explorer
-
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
ClassEval-Pro benchmark shows frontier LLMs achieve at most 45.6% Pass@1 on class-level code tasks, with logic errors (56%) and dependency errors (38%) as dominant failure modes.
-
VeriGraphi: A Multi-Agent Framework of Hierarchical RTL Generation for Large Hardware Designs
VeriGraphi introduces a knowledge-graph-anchored multi-agent pipeline that produces reliable hierarchical synthesizable Verilog for complex designs such as RISC-V processors.
-
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.