Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, Haoxing Ren · 2025 · arXiv 2506.14074

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

representative citing papers

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

cs.AR · 2026-06-25 · unverdicted · novelty 7.0

CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.

ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

cs.MA · 2026-05-13 · unverdicted · novelty 7.0

A self-trained multi-agent RL framework pairs Verilog and Python agents for oracle-free mutual verification in RTL generation and reports 75.0% / 80.1% pass@1 on VerilogEval V2 using 4B / 9B models.

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

cs.AR · 2026-04-21 · unverdicted · novelty 7.0

ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.

RuC: HDL-Agnostic Rule Completion Benchmark Generation

cs.AR · 2026-04-30 · unverdicted · novelty 6.0

RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.

Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0 · 2 refs

Spec2Cov uses an LLM agent in a feedback loop with a hardware simulator to generate tests from specs, achieving 100% coverage on simple designs and up to 49% on complex ones across 26 benchmarks.

Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.

SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

cs.CR · 2026-04-29 · unverdicted · novelty 5.0

SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without altering the base model.

Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification

cs.AR · 2026-04-17 · unverdicted · novelty 5.0

Domain-specialized LLM agents for hardware verification close 95-99% coverage using 4-13x fewer tokens and 2-4x faster convergence than general-purpose agents by reallocating tokens toward coverage-directed reasoning.

citing papers explorer

Showing 10 of 10 citing papers after filters.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks cs.AI · 2026-04-16 · unverdicted · none · ref 21
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research cs.AR · 2026-06-25 · unverdicted · none · ref 84
CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.
ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation cs.MA · 2026-05-13 · unverdicted · none · ref 20
A self-trained multi-agent RL framework pairs Verilog and Python agents for oracle-free mutual verification in RTL generation and reports 75.0% / 80.1% pass@1 on VerilogEval V2 using 4B / 9B models.
ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration cs.AR · 2026-04-21 · unverdicted · none · ref 2
ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision cs.AI · 2026-05-15 · unverdicted · none · ref 15
RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.
RuC: HDL-Agnostic Rule Completion Benchmark Generation cs.AR · 2026-04-30 · unverdicted · none · ref 9
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.
Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs cs.AR · 2026-04-17 · unverdicted · none · ref 10 · 2 links
Spec2Cov uses an LLM agent in a feedback loop with a hardware simulator to generate tests from specs, achieving 100% coverage on simple designs and up to 49% on complex ones across 26 benchmarks.
Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement cs.AI · 2026-04-16 · unverdicted · none · ref 29
Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.
SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation cs.CR · 2026-04-29 · unverdicted · none · ref 16
SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without altering the base model.
Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification cs.AR · 2026-04-17 · unverdicted · none · ref 27
Domain-specialized LLM agents for hardware verification close 95-99% coverage using 4-13x fewer tokens and 2-4x faster convergence than general-purpose agents by reallocating tokens toward coverage-directed reasoning.

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

fields

years

verdicts

representative citing papers

citing papers explorer