pith. machine review for the scientific record. sign in

arxiv: 2603.11287 · v2 · submitted 2026-03-11 · 💻 cs.AR · cs.SE

Recognition: 2 theorem links

· Lean Theorem

Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:31 UTC · model grok-4.3

classification 💻 cs.AR cs.SE
keywords RTL generationLLM evaluationVerilog synthesisHardware Quality Indexfailure taxonomysynthesis failurespost-synthesis quality
0
0 comments X

The pith

Evaluating 32 LLMs on 202 Verilog tasks with synthesis-in-the-loop reveals three performance regimes and distinct failure patterns by model type.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLM-generated RTL can be assessed for hardware quality beyond syntax and function by running designs through a full synthesis flow. It defines the Hardware Quality Index to score outputs on area, delay, and warnings compared to expert references in a 45nm process. Results divide the 32 models into three groups: 14 strong performers above 66 HQI, 15 in the middle range, and 3 weak ones below 43, with multiple sampling attempts lifting scores by several to over 20 points. Failure analysis of 195 cases shows proprietary models hitting problems late in elaboration or timing out, while open models fail early from structural omissions or non-synthesizable code. This matters for using LLMs in actual hardware pipelines where the generated code must produce efficient, working silicon.

Core claim

When 32 language models generate RTL for 202 Verilog tasks, three regimes emerge with 14 frontier models above 66 HQI led by Gemini-3-Pro at 87.5 percent coverage and 85.1 HQI, 15 models clustered between 43 and 66, and 3 below 43. The gap between best-of-five and single-attempt quality reaches 3.7 to 22.1 HQI points. A taxonomy of 195 synthesis failures shows proprietary models failing late through elaboration errors and synthesis timeout while open models fail early from missing module wrappers and non-synthesizable constructs, a pattern tied to training data skewed toward simulation rather than synthesis-grade RTL.

What carries the argument

Hardware Quality Index (HQI) that combines post-synthesis area, delay, and warning counts relative to expert reference designs in a Nangate45 45nm flow.

If this is right

  • Best-of-five sampling is required to reach usable RTL quality, limiting direct single-shot deployment in agentic design systems.
  • Open models need training data that emphasizes synthesizable constructs and proper module structure to reduce early failures.
  • Proprietary models require improved robustness against late-stage elaboration and timeout errors.
  • Synthesis feedback must be incorporated into evaluation and generation loops for reliable hardware output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A closed-loop system that feeds synthesis warnings and area reports back to the model could iteratively raise HQI scores beyond current single-pass levels.
  • The early-failure pattern in open models suggests that expanding training corpora with verified synthesis examples would narrow the gap to frontier performance.
  • Testing the same tasks on additional process nodes would clarify whether the observed quality regimes are technology-specific or broadly consistent.

Load-bearing premise

The specific Nangate45 45nm synthesis flow and the chosen combination of area, delay, and warnings into HQI give a representative measure of hardware quality that holds across different technologies and design styles.

What would settle it

Re-running the full set of 202 tasks through an alternate synthesis flow such as a 7nm node or a different commercial tool would show whether the three performance regimes and the early-versus-late failure split remain stable or shift.

Figures

Figures reproduced from arXiv: 2603.11287 by Johann Knechtel, Minghao Shao, Muhammad Shafique, Ozgur Sinanoglu, Ramesh Karri, Weimin Fu, Xiaolong Guo, Zeng Wang.

Figure 1
Figure 1. Figure 1: Coverage–Global HQI capability landscape for 32 language models under synthesis-in-the-loop evaluation. Three [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Best-of-five HQI heatmap across eight hardware categories and 32 models. Models are ordered left-to-right by Global [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-attempt Expected HQI heatmap across eight hardware categories and 32 models, using the same ordering as [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inference characteristics across all 32 models measured via the OpenRouter API. Cost and TTFT use log scale. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

RTL generation is more than code synthesis. Designs must be syntactically valid, synthesizable, correct, hardware-efficient. SOTA evaluations stop at functional correctness and do not measure synthesis and implementation quality. This paper evaluates 32 language models on 202 Verilog tasks from VerilogEval and RTLLM using the Hardware Quality Index (HQI) that combines post-synthesis area, delay, and warnings related to expert references in a Nangate45 45\,nm flow. Three performance regimes emerge: 14 frontier models achieve HQI $>$ 66, led by Gemini-3-Pro at 87.5\% coverage and 85.1 HQI; 15 models cluster 43--66 HQI; 3 are below 43. Gap between best-of-five capability and single-attempt quality spans 3.7--22.1 HQI points, limiting integration into agentic pipelines. A taxonomy of 195 synthesis failures reveals systematic divergence: proprietary models fail late through elaboration errors and synthesis timeout; open models fail early often due to missing module wrappers and non-synthesizable constructs, a pattern consistent with training corpora skewed toward simulation over synthesis-grade RTL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates 32 LLMs on 202 Verilog RTL generation tasks drawn from VerilogEval and RTLLM. It introduces the Hardware Quality Index (HQI) that aggregates post-synthesis area, delay, and warning metrics relative to expert references inside a single Nangate45 45 nm synthesis flow. The central claims are the emergence of three HQI regimes (14 frontier models >66, led by Gemini-3-Pro at 87.5 % coverage and 85.1 HQI; 15 models in 43–66; 3 below 43), a 3.7–22.1 point gap between best-of-five and single-attempt quality, and a taxonomy of 195 synthesis failures showing proprietary models failing late (elaboration/timeout) versus open models failing early (missing wrappers, non-synthesizable constructs).

Significance. If the reported regimes and failure patterns prove robust, the work meaningfully advances RTL-generation evaluation by moving beyond functional correctness to post-synthesis quality. The scale (32 models, 202 tasks) and concrete taxonomy supply actionable data for training-data curation and prompting strategies. The synthesis-in-the-loop methodology itself is a clear methodological strength over simulation-only benchmarks.

major comments (2)
  1. [§4] §4 (Results): The three HQI regimes and the proprietary/open divergence in the 195-failure taxonomy rest entirely on the Nangate45 45 nm flow. No sensitivity analysis or ablation across alternative nodes, tools (OpenROAD vs commercial), or optimization targets is presented; a change in area/delay penalties or elaboration triggers could reorder models between regimes or erase the reported model-type split.
  2. [§3] §3 (Methodology): The exact HQI formula—weights, normalization of area/delay/warnings to expert references, and handling of timeouts—is described only qualitatively. Without the explicit equation or pseudocode, independent reproduction of the 85.1 HQI for Gemini-3-Pro or verification of regime boundaries is not possible.
minor comments (2)
  1. [Abstract] Abstract: State the precise split of the 202 tasks between VerilogEval and RTLLM so readers can assess benchmark coverage.
  2. [Results tables] Table 2 or equivalent results table: Include per-model HQI, coverage, and best-of-five values for all 32 models to support the regime and gap claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and reproducibility of our evaluation. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Results): The three HQI regimes and the proprietary/open divergence in the 195-failure taxonomy rest entirely on the Nangate45 45 nm flow. No sensitivity analysis or ablation across alternative nodes, tools (OpenROAD vs commercial), or optimization targets is presented; a change in area/delay penalties or elaboration triggers could reorder models between regimes or erase the reported model-type split.

    Authors: We agree that the reported HQI regimes and the observed proprietary/open failure divergence are derived from a single synthesis flow. This design choice ensured computational tractability and strict consistency across 32 models and 202 tasks. We acknowledge the absence of a multi-node or multi-tool ablation as a limitation. In the revision we will add a dedicated paragraph in §4 discussing the potential sensitivity of the results to alternative flows, drawing on our internal checks with modified optimization scripts that preserved the top-tier regime boundaries for the leading models. The failure-mode taxonomy itself is largely independent of the synthesis tool, as it is driven by code-level issues (e.g., missing wrappers, non-synthesizable constructs) identified prior to synthesis. revision: partial

  2. Referee: [§3] §3 (Methodology): The exact HQI formula—weights, normalization of area/delay/warnings to expert references, and handling of timeouts—is described only qualitatively. Without the explicit equation or pseudocode, independent reproduction of the 85.1 HQI for Gemini-3-Pro or verification of regime boundaries is not possible.

    Authors: The referee is correct that the HQI definition in the current manuscript is qualitative. We will insert the precise formula, including the weighting scheme, normalization procedure relative to expert references, and timeout handling, together with pseudocode, into the revised §3. This addition will enable exact reproduction of all reported HQI values and regime thresholds. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with external synthesis measurements

full rationale

The paper reports direct measurements of 32 LLMs on 202 Verilog tasks using post-synthesis area, delay, and warning counts aggregated into HQI relative to expert references inside a fixed Nangate45 flow. No equations, predictions, or derivations are present that could reduce to inputs by construction. Performance regimes, best-of-five gaps, and the proprietary/open failure taxonomy are observational clusters and counts extracted from the synthesis runs; they are not forced by any self-definition, fitted parameter renamed as prediction, or self-citation chain. The study is self-contained against external benchmarks and tools, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper. No free parameters, axioms, or invented entities are described in the abstract; HQI is presented as a composite metric without explicit formula or fitting details.

pith-pipeline@v0.9.0 · 5536 in / 1203 out tokens · 55413 ms · 2026-05-15T12:31:14.148676+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Configuration Over Selection: Hyperparameter Sensitivity Exceeds Model Differences in Open-Source LLMs for RTL Generation

    cs.AR 2026-04 unverdicted novelty 6.0

    Hyperparameter configuration in open-source LLMs for RTL generation produces up to 25.5% intra-model pass-rate variation on VerilogEval and RTLLM, exceeding inter-model spreads by 5x with near-zero correlation in opti...

  2. LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges

    cs.CR 2026-05 unverdicted novelty 3.0

    A survey of LLM applications in secure hardware design covering EDA synthesis, vulnerability analysis, countermeasures, and educational uses.

  3. LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges

    cs.CR 2026-05 accept novelty 2.0

    LLMs enable RTL code generation and vulnerability analysis in hardware design but introduce data contamination and adversarial risks that require red-teaming and dynamic benchmarking.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    Reem Aleithan. 2025. Revisiting SWE-Bench: On the Importance of Data Quality for LLM-Based Code Models. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025 – Companion Proceedings, Ottawa, ON, Canada, April 27 – May 3, 2025. IEEE, 235–236. doi:10.1109/ICSE-COMPANION66252. 2025.00075

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models.CoRR abs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732

  3. [3]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  4. [4]

    Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pu...

  5. [5]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evalua- tion of Large Language Models for Code Generation. InAdvances in Neu- ral Information Processing Systems 36: Annual Conference on Neural In- formation Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December...

  6. [6]

    Mingjie Liu, Nathaniel Ross Pinckney, Brucek Khailany, and Haoxing Ren. 2023. Invited Paper: VerilogEval: Evaluating Large Language Models for Verilog Code Generation. InIEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, October 28 – November 2, 2023. IEEE, 1–8. doi:10.1109/ICCAD57390.2023.10323812

  7. [7]

    Shang Liu, Wenji Fang, Yao Lu, Jing Wang, Qijun Zhang, Hongce Zhang, and Zhiyao Xie. 2025. RTLCoder: Fully Open-Source and Efficient LLM-Assisted RTL Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Fu et al. Code Generation Technique.IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 44, 4 (2025), 1448–1461. doi:10.1109/TCAD.2024.3483089

  8. [8]

    Shang Liu, Yao Lu, Wenji Fang, Mengming Li, and Zhiyao Xie. 2024. OpenLLM- RTL: Open Dataset and Benchmark for LLM-Aided Design RTL Generation. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2024, Newark, NJ, USA, October 27–31, 2024. ACM, 60:1–60:9. doi:10.1145/3676536.3697118

  9. [9]

    Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. 2024. RTLLM: An Open- Source Benchmark for Design RTL Generation with Large Language Model. In Proceedings of the 29th Asia and South Pacific Design Automation Conference, ASPDAC 2024, Incheon, Korea, January 22–25, 2024. IEEE, 722–727. doi:10.1109/ ASP-DAC58780.2024.10473904

  10. [10]

    Nangate Inc. 2008. The NanGate 45nm Open Cell Library. https://si2.org/

  11. [11]

    Nathaniel Ross Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, and Brucek Khailany. 2025. Revisiting VerilogEval: A Year of Improvements in Large- Language Models for Hardware Code Generation.ACM Trans. Design Autom. Electr. Syst.30, 6 (2025), 91:1–91:20. doi:10.1145/3718088

  12. [12]

    Mohammed Latif Siddiq, Simantika Dristi, Joy Saha, and Joanna C. S. San- tos. 2024. The Fault in our Stars: Quality Assessment of Code Generation Benchmarks. InIEEE International Conference on Source Code Analysis and Ma- nipulation, SCAM 2024, Flagstaff, AZ, USA, October 7–8, 2024. IEEE, 201–212. doi:10.1109/SCAM63643.2024.00028

  13. [13]

    Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan- Gavitt, Ramesh Karri, and Siddharth Garg. 2024. VeriGen: A Large Language Model for Verilog Code Generation.ACM Trans. Design Autom. Electr. Syst.29, 3 (2024), 46:1–46:31. doi:10.1145/3643681

  14. [14]

    Clifford Wolf. 2013. Yosys Open SYnthesis Suite. https://yosyshq.net/yosys/