RTL-BenchLS supplies a large-scale formally verified benchmark and three novel tasks that expose low performance of frontier LLMs on realistic RTL reasoning and generation.
hub
Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
years
2026 16verdicts
UNVERDICTED 16representative citing papers
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.
AssertLLM2 introduces a benchmark of 83 designs supporting bug-prevention and bug-hunting assertion generation tasks with evaluation across syntactic, formal, coverage, and mutation-based metrics.
A self-trained multi-agent RL framework pairs Verilog and Python agents for oracle-free mutual verification in RTL generation and reports 75.0% / 80.1% pass@1 on VerilogEval V2 using 4B / 9B models.
ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.
STG generates deterministic testbenches 720x faster than iterative LLM flows with higher coverage and fewer false passes, while serving as an 11x faster data curation engine with 127x less energy.
CASS-RTL identifies correctness-linked attention heads, builds a steering subspace from them, and applies a geometry-aware intervention that raises pass@1/5/10 accuracy 10-20% on VerilogEval and 5% on CVDP across multiple LLMs without retraining or extra labels.
RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.
Spec2Cov uses an LLM agent in a feedback loop with a hardware simulator to generate tests from specs, achieving 100% coverage on simple designs and up to 49% on complex ones across 26 benchmarks.
Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.
SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without altering the base model.
Domain-specialized LLM agents for hardware verification close 95-99% coverage using 4-13x fewer tokens and 2-4x faster convergence than general-purpose agents by reallocating tokens toward coverage-directed reasoning.
HORIZON applies repository-level self-evolution to hardware design artifacts and reports 100% completion on ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories using a hands-free agent loop.
Framework uses LLM-driven stepwise application of transformation rules to generate verifiable RTL hardware designs from specifications.
citing papers explorer
-
RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models
RTL-BenchLS supplies a large-scale formally verified benchmark and three novel tasks that expose low performance of frontier LLMs on realistic RTL reasoning and generation.
-
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
-
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
STG generates deterministic testbenches 720x faster than iterative LLM flows with higher coverage and fewer false passes, while serving as an 11x faster data curation engine with 127x less energy.
-
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision
RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.
-
Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement
Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.