HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8verdicts
UNVERDICTED 8representative citing papers
ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.
RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.
Spec2Cov uses an LLM agent in a feedback loop with a hardware simulator to generate tests from specs, achieving 100% coverage on simple designs and up to 49% on complex ones across 26 benchmarks.
Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.
SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without altering the base model.
Domain-specialized LLM agents for hardware verification close 95-99% coverage using 4-13x fewer tokens and 2-4x faster convergence than general-purpose agents by reallocating tokens toward coverage-directed reasoning.
citing papers explorer
-
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
-
ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration
ChipCraftBrain achieves 97.2% pass rate on VerilogEval and 94.7% on CVDP benchmarks for generating functional RTL code using adaptive multi-agent orchestration and hybrid reasoning.
-
RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision
RTL-BenchMT is an agent-assisted framework for dynamically maintaining RTL generation benchmarks by fixing flaws and reducing overfitting in LLM-based EDA applications.
-
RuC: HDL-Agnostic Rule Completion Benchmark Generation
RuC generates language-agnostic, grammar-based benchmarks for evaluating LLMs on RTL code completion at controllable granularities, demonstrated on SystemVerilog designs from Tiny Tapeout and a RISC-V core where Fill-in-the-Middle prompting performed best.
-
Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs
Spec2Cov uses an LLM agent in a feedback loop with a hardware simulator to generate tests from specs, achieving 100% coverage on simple designs and up to 49% on complex ones across 26 benchmarks.
-
Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement
Dr. RTL's multi-agent framework with group-relative skill learning achieves 21% WNS and 17% TNS timing improvements plus 6% area reduction on 20 real-world RTL designs over commercial synthesis tools.
-
SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation
SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without altering the base model.
-
Understanding Inference-Time Token Allocation and Coverage Limits in Agentic Hardware Verification
Domain-specialized LLM agents for hardware verification close 95-99% coverage using 4-13x fewer tokens and 2-4x faster convergence than general-purpose agents by reallocating tokens toward coverage-directed reasoning.