hub

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, Haoxing Ren · 2025 · arXiv 2506.14074

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models

cs.AI · 2026-06-08 · unverdicted · novelty 8.0

RTL-BenchLS supplies a large-scale formally verified benchmark and three novel tasks that expose low performance of frontier LLMs on realistic RTL reasoning and generation.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

cs.AR · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.

AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications

cs.AR · 2026-05-26 · unverdicted · novelty 7.0

AssertLLM2 introduces a benchmark of 83 designs supporting bug-prevention and bug-hunting assertion generation tasks with evaluation across syntactic, formal, coverage, and mutation-based metrics.

ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

cs.MA · 2026-05-13 · unverdicted · novelty 7.0

A self-trained multi-agent RL framework pairs Verilog and Python agents for oracle-free mutual verification in RTL generation and reports 75.0% / 80.1% pass@1 on VerilogEval V2 using 4B / 9B models.

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

cs.AR · 2026-04-21 · unverdicted · novelty 7.0

ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.

Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

STG generates deterministic testbenches 720x faster than iterative LLM flows with higher coverage and fewer false passes, while serving as an 11x faster data curation engine with 127x less energy.

CASS-RTL: Correctness-Aware Subspace Steering for RTL Generation with LLMs

cs.PL · 2026-06-04 · unverdicted · novelty 6.0

CASS-RTL identifies correctness-linked attention heads, builds a steering subspace from them, and applies a geometry-aware intervention that raises pass@1/5/10 accuracy 10-20% on VerilogEval and 5% on CVDP across multiple LLMs without retraining or extra labels.

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.

RuC: HDL-Agnostic Rule Completion Benchmark Generation

cs.AR · 2026-04-30 · unverdicted · novelty 6.0

RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.

Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0 · 2 refs

Spec2Cov uses an LLM agent in a feedback loop with a hardware simulator to generate tests from specs, achieving 100% coverage on simple designs and up to 49% on complex ones across 26 benchmarks.

Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.

SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

cs.CR · 2026-04-29 · unverdicted · novelty 5.0

SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without altering the base model.

Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification

cs.AR · 2026-04-17 · unverdicted · novelty 5.0

Domain-specialized LLM agents for hardware verification close 95-99% coverage using 4-13x fewer tokens and 2-4x faster convergence than general-purpose agents by reallocating tokens toward coverage-directed reasoning.

Agentic Hardware Design as Repository-Level Code Evolution

cs.AR · 2026-06-26 · unverdicted · novelty 4.0

HORIZON applies repository-level self-evolution to hardware design artifacts and reports 100% completion on ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories using a hands-free agent loop.

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

cs.SE · 2026-06-16 · unverdicted · novelty 4.0

Framework uses LLM-driven stepwise application of transformation rules to generate verifiable RTL hardware designs from specifications.

citing papers explorer

Showing 5 of 5 citing papers after filters.

RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models cs.AI · 2026-06-08 · unverdicted · none · ref 29
RTL-BenchLS supplies a large-scale formally verified benchmark and three novel tasks that expose low performance of frontier LLMs on realistic RTL reasoning and generation.
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks cs.AI · 2026-04-16 · unverdicted · none · ref 21
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation cs.AI · 2026-06-11 · unverdicted · none · ref 15
STG generates deterministic testbenches 720x faster than iterative LLM flows with higher coverage and fewer false passes, while serving as an 11x faster data curation engine with 127x less energy.
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision cs.AI · 2026-05-15 · unverdicted · none · ref 15
RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.
Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement cs.AI · 2026-04-16 · unverdicted · none · ref 29
Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.

Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer