arxiv: 2603.11287 · v2 · submitted 2026-03-11 · 💻 cs.AR · cs.SE

Recognition: 2 theorem links

· Lean Theorem

Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

Weimin Fu , Zeng Wang , Minghao Shao , Ramesh Karri , Muhammad Shafique , Johann Knechtel , Ozgur Sinanoglu , Xiaolong Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:31 UTC · model grok-4.3

classification 💻 cs.AR cs.SE

keywords RTL generationLLM evaluationVerilog synthesisHardware Quality Indexfailure taxonomysynthesis failurespost-synthesis quality

0 comments

The pith

Evaluating 32 LLMs on 202 Verilog tasks with synthesis-in-the-loop reveals three performance regimes and distinct failure patterns by model type.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLM-generated RTL can be assessed for hardware quality beyond syntax and function by running designs through a full synthesis flow. It defines the Hardware Quality Index to score outputs on area, delay, and warnings compared to expert references in a 45nm process. Results divide the 32 models into three groups: 14 strong performers above 66 HQI, 15 in the middle range, and 3 weak ones below 43, with multiple sampling attempts lifting scores by several to over 20 points. Failure analysis of 195 cases shows proprietary models hitting problems late in elaboration or timing out, while open models fail early from structural omissions or non-synthesizable code. This matters for using LLMs in actual hardware pipelines where the generated code must produce efficient, working silicon.

Core claim

When 32 language models generate RTL for 202 Verilog tasks, three regimes emerge with 14 frontier models above 66 HQI led by Gemini-3-Pro at 87.5 percent coverage and 85.1 HQI, 15 models clustered between 43 and 66, and 3 below 43. The gap between best-of-five and single-attempt quality reaches 3.7 to 22.1 HQI points. A taxonomy of 195 synthesis failures shows proprietary models failing late through elaboration errors and synthesis timeout while open models fail early from missing module wrappers and non-synthesizable constructs, a pattern tied to training data skewed toward simulation rather than synthesis-grade RTL.

What carries the argument

Hardware Quality Index (HQI) that combines post-synthesis area, delay, and warning counts relative to expert reference designs in a Nangate45 45nm flow.

If this is right

Best-of-five sampling is required to reach usable RTL quality, limiting direct single-shot deployment in agentic design systems.
Open models need training data that emphasizes synthesizable constructs and proper module structure to reduce early failures.
Proprietary models require improved robustness against late-stage elaboration and timeout errors.
Synthesis feedback must be incorporated into evaluation and generation loops for reliable hardware output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A closed-loop system that feeds synthesis warnings and area reports back to the model could iteratively raise HQI scores beyond current single-pass levels.
The early-failure pattern in open models suggests that expanding training corpora with verified synthesis examples would narrow the gap to frontier performance.
Testing the same tasks on additional process nodes would clarify whether the observed quality regimes are technology-specific or broadly consistent.

Load-bearing premise

The specific Nangate45 45nm synthesis flow and the chosen combination of area, delay, and warnings into HQI give a representative measure of hardware quality that holds across different technologies and design styles.

What would settle it

Re-running the full set of 202 tasks through an alternate synthesis flow such as a 7nm node or a different commercial tool would show whether the three performance regimes and the early-versus-late failure split remain stable or shift.

Figures

Figures reproduced from arXiv: 2603.11287 by Johann Knechtel, Minghao Shao, Muhammad Shafique, Ozgur Sinanoglu, Ramesh Karri, Weimin Fu, Xiaolong Guo, Zeng Wang.

**Figure 1.** Figure 1: Coverage–Global HQI capability landscape for 32 language models under synthesis-in-the-loop evaluation. Three [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Best-of-five HQI heatmap across eight hardware categories and 32 models. Models are ordered left-to-right by Global [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Per-attempt Expected HQI heatmap across eight hardware categories and 32 models, using the same ordering as [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Inference characteristics across all 32 models measured via the OpenRouter API. Cost and TTFT use log scale. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

RTL generation is more than code synthesis. Designs must be syntactically valid, synthesizable, correct, hardware-efficient. SOTA evaluations stop at functional correctness and do not measure synthesis and implementation quality. This paper evaluates 32 language models on 202 Verilog tasks from VerilogEval and RTLLM using the Hardware Quality Index (HQI) that combines post-synthesis area, delay, and warnings related to expert references in a Nangate45 45\,nm flow. Three performance regimes emerge: 14 frontier models achieve HQI $>$ 66, led by Gemini-3-Pro at 87.5\% coverage and 85.1 HQI; 15 models cluster 43--66 HQI; 3 are below 43. Gap between best-of-five capability and single-attempt quality spans 3.7--22.1 HQI points, limiting integration into agentic pipelines. A taxonomy of 195 synthesis failures reveals systematic divergence: proprietary models fail late through elaboration errors and synthesis timeout; open models fail early often due to missing module wrappers and non-synthesizable constructs, a pattern consistent with training corpora skewed toward simulation over synthesis-grade RTL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a synthesis-in-the-loop Hardware Quality Index and a taxonomy of 195 RTL failures that separates proprietary and open models, but the reported regimes rest on one Nangate45 flow.

read the letter

The main thing to know is that they move LLM RTL evaluation past functional simulation checks and into post-synthesis quality. They define HQI from area, delay, and warning counts relative to expert designs, run 32 models on 202 tasks, and surface three performance buckets plus a clear best-of-five gap of 3.7 to 22 points. The failure taxonomy is the concrete new piece: proprietary models mostly hit late elaboration or timeout problems, while open models fail early on missing wrappers and non-synthesizable constructs. That split tracks with training data differences and is worth seeing at this scale. The work is straightforward empirical benchmarking done at decent volume, and the numbers on coverage and quality lift from multiple attempts are useful for anyone trying to plug these models into real design scripts. The soft spot is exactly what the stress test flags. HQI and the regime boundaries are tied to the Nangate45 45 nm script and its particular area-delay-warning weighting. Swap the library, node, or tool and the penalties shift, which could reorder models or erase the proprietary/open pattern. No sensitivity run is described, so the central claims stay conditional on that one flow. Readers working on hardware-aware LLM agents or RTL benchmarks will get direct value from the taxonomy and the gap data. It is solid enough on its own terms to deserve a serious referee, mainly to check the HQI formula details and ask for at least one cross-flow verification. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates 32 LLMs on 202 Verilog RTL generation tasks drawn from VerilogEval and RTLLM. It introduces the Hardware Quality Index (HQI) that aggregates post-synthesis area, delay, and warning metrics relative to expert references inside a single Nangate45 45 nm synthesis flow. The central claims are the emergence of three HQI regimes (14 frontier models >66, led by Gemini-3-Pro at 87.5 % coverage and 85.1 HQI; 15 models in 43–66; 3 below 43), a 3.7–22.1 point gap between best-of-five and single-attempt quality, and a taxonomy of 195 synthesis failures showing proprietary models failing late (elaboration/timeout) versus open models failing early (missing wrappers, non-synthesizable constructs).

Significance. If the reported regimes and failure patterns prove robust, the work meaningfully advances RTL-generation evaluation by moving beyond functional correctness to post-synthesis quality. The scale (32 models, 202 tasks) and concrete taxonomy supply actionable data for training-data curation and prompting strategies. The synthesis-in-the-loop methodology itself is a clear methodological strength over simulation-only benchmarks.

major comments (2)

[§4] §4 (Results): The three HQI regimes and the proprietary/open divergence in the 195-failure taxonomy rest entirely on the Nangate45 45 nm flow. No sensitivity analysis or ablation across alternative nodes, tools (OpenROAD vs commercial), or optimization targets is presented; a change in area/delay penalties or elaboration triggers could reorder models between regimes or erase the reported model-type split.
[§3] §3 (Methodology): The exact HQI formula—weights, normalization of area/delay/warnings to expert references, and handling of timeouts—is described only qualitatively. Without the explicit equation or pseudocode, independent reproduction of the 85.1 HQI for Gemini-3-Pro or verification of regime boundaries is not possible.

minor comments (2)

[Abstract] Abstract: State the precise split of the 202 tasks between VerilogEval and RTLLM so readers can assess benchmark coverage.
[Results tables] Table 2 or equivalent results table: Include per-model HQI, coverage, and best-of-five values for all 32 models to support the regime and gap claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the scope and reproducibility of our evaluation. We address each major comment below and indicate the changes planned for the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (Results): The three HQI regimes and the proprietary/open divergence in the 195-failure taxonomy rest entirely on the Nangate45 45 nm flow. No sensitivity analysis or ablation across alternative nodes, tools (OpenROAD vs commercial), or optimization targets is presented; a change in area/delay penalties or elaboration triggers could reorder models between regimes or erase the reported model-type split.

Authors: We agree that the reported HQI regimes and the observed proprietary/open failure divergence are derived from a single synthesis flow. This design choice ensured computational tractability and strict consistency across 32 models and 202 tasks. We acknowledge the absence of a multi-node or multi-tool ablation as a limitation. In the revision we will add a dedicated paragraph in §4 discussing the potential sensitivity of the results to alternative flows, drawing on our internal checks with modified optimization scripts that preserved the top-tier regime boundaries for the leading models. The failure-mode taxonomy itself is largely independent of the synthesis tool, as it is driven by code-level issues (e.g., missing wrappers, non-synthesizable constructs) identified prior to synthesis. revision: partial
Referee: [§3] §3 (Methodology): The exact HQI formula—weights, normalization of area/delay/warnings to expert references, and handling of timeouts—is described only qualitatively. Without the explicit equation or pseudocode, independent reproduction of the 85.1 HQI for Gemini-3-Pro or verification of regime boundaries is not possible.

Authors: The referee is correct that the HQI definition in the current manuscript is qualitative. We will insert the precise formula, including the weighting scheme, normalization procedure relative to expert references, and timeout handling, together with pseudocode, into the revised §3. This addition will enable exact reproduction of all reported HQI values and regime thresholds. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with external synthesis measurements

full rationale

The paper reports direct measurements of 32 LLMs on 202 Verilog tasks using post-synthesis area, delay, and warning counts aggregated into HQI relative to expert references inside a fixed Nangate45 flow. No equations, predictions, or derivations are present that could reduce to inputs by construction. Performance regimes, best-of-five gaps, and the proprietary/open failure taxonomy are observational clusters and counts extracted from the synthesis runs; they are not forced by any self-definition, fitted parameter renamed as prediction, or self-citation chain. The study is self-contained against external benchmarks and tools, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical evaluation paper. No free parameters, axioms, or invented entities are described in the abstract; HQI is presented as a composite metric without explicit formula or fitting details.

pith-pipeline@v0.9.0 · 5536 in / 1203 out tokens · 55413 ms · 2026-05-15T12:31:14.148676+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cost=0.5· â/a*_t +0.5· d̂/d*_t +0.1·max(0, ŵ-w*_t); HQI=min(100/cost,100)
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Three performance regimes emerge: 14 frontier models achieve HQI > 66 ... taxonomy of 195 synthesis failures

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Configuration Over Selection: Hyperparameter Sensitivity Exceeds Model Differences in Open-Source LLMs for RTL Generation
cs.AR 2026-04 unverdicted novelty 6.0

Hyperparameter configuration in open-source LLMs for RTL generation produces up to 25.5% intra-model pass-rate variation on VerilogEval and RTLLM, exceeding inter-model spreads by 5x with near-zero correlation in opti...
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
cs.CR 2026-05 unverdicted novelty 3.0

A survey of LLM applications in secure hardware design covering EDA synthesis, vulnerability analysis, countermeasures, and educational uses.
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
cs.CR 2026-05 accept novelty 2.0

LLMs enable RTL code generation and vulnerability analysis in hardware design but introduce data contamination and adversarial risks that require red-teaming and dynamic benchmarking.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Reem Aleithan. 2025. Revisiting SWE-Bench: On the Importance of Data Quality for LLM-Based Code Models. In47th IEEE/ACM International Conference on Software Engineering, ICSE 2025 – Companion Proceedings, Ottawa, ON, Canada, April 27 – May 3, 2025. IEEE, 235–236. doi:10.1109/ICSE-COMPANION66252. 2025.00075

work page doi:10.1109/icse-companion66252 2025
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models.CoRR abs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pu...

work page doi:10.48550/arxiv.2203.07814 2022
[5]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evalua- tion of Large Language Models for Code Generation. InAdvances in Neu- ral Information Processing Systems 36: Annual Conference on Neural In- formation Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December...

work page 2023
[6]

Mingjie Liu, Nathaniel Ross Pinckney, Brucek Khailany, and Haoxing Ren. 2023. Invited Paper: VerilogEval: Evaluating Large Language Models for Verilog Code Generation. InIEEE/ACM International Conference on Computer Aided Design, ICCAD 2023, San Francisco, CA, USA, October 28 – November 2, 2023. IEEE, 1–8. doi:10.1109/ICCAD57390.2023.10323812

work page doi:10.1109/iccad57390.2023.10323812 2023
[7]

Shang Liu, Wenji Fang, Yao Lu, Jing Wang, Qijun Zhang, Hongce Zhang, and Zhiyao Xie. 2025. RTLCoder: Fully Open-Source and Efficient LLM-Assisted RTL Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Fu et al. Code Generation Technique.IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 44, 4 (2025), 1448–1461. doi:10.1109/TCAD.2024.3483089

work page doi:10.1109/tcad.2024.3483089 2025
[8]

Shang Liu, Yao Lu, Wenji Fang, Mengming Li, and Zhiyao Xie. 2024. OpenLLM- RTL: Open Dataset and Benchmark for LLM-Aided Design RTL Generation. In Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, ICCAD 2024, Newark, NJ, USA, October 27–31, 2024. ACM, 60:1–60:9. doi:10.1145/3676536.3697118

work page doi:10.1145/3676536.3697118 2024
[9]

Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. 2024. RTLLM: An Open- Source Benchmark for Design RTL Generation with Large Language Model. In Proceedings of the 29th Asia and South Pacific Design Automation Conference, ASPDAC 2024, Incheon, Korea, January 22–25, 2024. IEEE, 722–727. doi:10.1109/ ASP-DAC58780.2024.10473904

work page arXiv 2024
[10]

Nangate Inc. 2008. The NanGate 45nm Open Cell Library. https://si2.org/

work page 2008
[11]

Nathaniel Ross Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, and Brucek Khailany. 2025. Revisiting VerilogEval: A Year of Improvements in Large- Language Models for Hardware Code Generation.ACM Trans. Design Autom. Electr. Syst.30, 6 (2025), 91:1–91:20. doi:10.1145/3718088

work page doi:10.1145/3718088 2025
[12]

Mohammed Latif Siddiq, Simantika Dristi, Joy Saha, and Joanna C. S. San- tos. 2024. The Fault in our Stars: Quality Assessment of Code Generation Benchmarks. InIEEE International Conference on Source Code Analysis and Ma- nipulation, SCAM 2024, Flagstaff, AZ, USA, October 7–8, 2024. IEEE, 201–212. doi:10.1109/SCAM63643.2024.00028

work page doi:10.1109/scam63643.2024.00028 2024
[13]

Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan- Gavitt, Ramesh Karri, and Siddharth Garg. 2024. VeriGen: A Large Language Model for Verilog Code Generation.ACM Trans. Design Autom. Electr. Syst.29, 3 (2024), 46:1–46:31. doi:10.1145/3643681

work page doi:10.1145/3643681 2024
[14]

Clifford Wolf. 2013. Yosys Open SYnthesis Suite. https://yosyshq.net/yosys/

work page 2013