Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification

· 2025 · arXiv 2506.14074

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

cs.AR · 2026-04-21 · unverdicted · novelty 7.0

ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.

RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.

RuC: HDL-Agnostic Rule Completion Benchmark Generation

cs.AR · 2026-04-30 · unverdicted · novelty 6.0

RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.

Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs

cs.AR · 2026-04-17 · unverdicted · novelty 6.0 · 2 refs

Spec2Cov uses an LLM agent in a feedback loop with a hardware simulator to generate tests from specs, achieving 100% coverage on simple designs and up to 49% on complex ones across 26 benchmarks.

Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.

SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

cs.CR · 2026-04-29 · unverdicted · novelty 5.0

SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without altering the base model.

Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification

cs.AR · 2026-04-17 · unverdicted · novelty 5.0

Domain-specialized LLM agents for hardware verification close 95-99% coverage using 4-13x fewer tokens and 2-4x faster convergence than general-purpose agents by reallocating tokens toward coverage-directed reasoning.

citing papers explorer

Showing 8 of 8 citing papers.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks cs.AI · 2026-04-16 · unverdicted · none · ref 21
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration cs.AR · 2026-04-21 · unverdicted · none · ref 2
ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision cs.AI · 2026-05-15 · unverdicted · none · ref 15
RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.
RuC: HDL-Agnostic Rule Completion Benchmark Generation cs.AR · 2026-04-30 · unverdicted · none · ref 9
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.
Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs cs.AR · 2026-04-17 · unverdicted · none · ref 10 · 2 links
Spec2Cov uses an LLM agent in a feedback loop with a hardware simulator to generate tests from specs, achieving 100% coverage on simple designs and up to 49% on complex ones across 26 benchmarks.
Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement cs.AI · 2026-04-16 · unverdicted · none · ref 29
Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.
SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation cs.CR · 2026-04-29 · unverdicted · none · ref 16
SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without altering the base model.
Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification cs.AR · 2026-04-17 · unverdicted · none · ref 27
Domain-specialized LLM agents for hardware verification close 95-99% coverage using 4-13x fewer tokens and 2-4x faster convergence than general-purpose agents by reallocating tokens toward coverage-directed reasoning.

Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification

fields

years

verdicts

representative citing papers

citing papers explorer