arxiv: 2604.17102 · v1 · submitted 2026-04-18 · 💻 cs.AR · cs.AI

Recognition: unknown

Configuration Over Selection: Hyperparameter Sensitivity Exceeds Model Differences in Open-Source LLMs for RTL Generation

Minghao Shao , Zeng Wang , Weimin Fu , Xiaolong Guo , Johann Knechtel , Ozgur Sinanoglu , Ramesh Karri , Muhammad Shafique

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:55 UTC · model grok-4.3

classification 💻 cs.AR cs.AI

keywords LLM benchmarkingRTL generationVeriloghyperparameter sensitivityhardware design automationdecoding parametersopen-source models

0 comments

The pith

For open-source LLMs generating RTL hardware code, hyperparameter settings create larger performance gaps than the choice of model itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that inference-time configuration choices affect success rates in producing correct RTL designs more than differences among model families. After mapping 26 open-source LLMs on two synthesis-checked benchmarks, the authors run a 108-setting sweep on three models and find that the largest pass-rate difference inside one model reaches 25.5 percent. This spread is five times the average gap measured across models when each uses its default configuration. Configurations that perform best on one benchmark show almost no rank correlation with performance on the second benchmark. The results indicate that fixed-default evaluations mix model quality with arbitrary tuning effects.

Core claim

Benchmarking 26 open-source LLMs on VerilogEval and RTLLM with synthesis-in-the-loop evaluation first establishes the current capability landscape. An exhaustive 108-configuration hyperparameter sweep on three prominent models then reveals absolute pass-rate gaps of up to 25.5 percent between the best and worst settings for the same LLM. This intra-model variation is five times larger than the average spread observed across model families under their respective default configurations. Ranking all configurations by Spearman's rho across the two benchmark suites yields near-zero correlation, showing that optimal configurations do not transfer between suites. These findings demonstrate that LLM

What carries the argument

The 108-configuration hyperparameter sweep (temperature, top-p, and related decoding parameters) evaluated by pass rate on synthesis-verified RTL benchmarks; the sweep directly measures and compares intra-model sensitivity against inter-model gaps at defaults.

If this is right

Evaluations that fix hyperparameters to defaults systematically confound model capability with arbitrary configuration effects.
Model rankings can reverse once each model receives its own optimized decoding settings.
Configurations that maximize performance on one RTL benchmark do not reliably maximize performance on another.
Future benchmarking protocols for hardware code generation must incorporate per-model and per-benchmark hyperparameter search.
Open-source LLM utility for RTL tasks depends more on tuning methodology than on the release of new base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Engineering time spent on decoding-parameter search and prompt refinement may yield larger gains than switching to newer models.
The same sensitivity pattern could appear in other code-generation tasks outside hardware design.
Benchmark maintainers could reduce confounding by publishing recommended hyperparameter ranges alongside each suite.
If closed-source models expose comparable decoding controls, the same sweep methodology could test whether they exhibit lower configuration sensitivity.

Load-bearing premise

The 108-configuration sweep performed on only three models is representative of sensitivity across all 26 evaluated models.

What would settle it

Repeating the full hyperparameter sweep on several additional models from the remaining twenty-three and finding that their average intra-model pass-rate gaps fall below the observed inter-model spread under defaults would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.17102 by Johann Knechtel, Minghao Shao, Muhammad Shafique, Ozgur Sinanoglu, Ramesh Karri, Weimin Fu, Xiaolong Guo, Zeng Wang.

**Figure 2.** Figure 2: Pass-rate and HQI landscape for 26 language models. Top: pass@1 (solid) and pass@5 (faded) sorted by descending [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: HQI across eight hardware categories and 26 models. Models are ordered left-to-right by Global HQI; categories top [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Inference performance metrics. align: DeepSeek V3.2 and GPT-OSS 120B achieve higher Global HQI relative to their coverage, indicating strong postsynthesis quality on the tasks they do solve. Generational progression does not uniformly improve RTL capability either; within the GLM family, newer releases trade coverage breadth for design depth, a distinction that pass-rate metrics alone do not capture. Unde… view at source ↗

**Figure 5.** Figure 5: Scatter plots of Pass Rate vs. HQI across differ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Hyperparameter–performance correlation across the three models. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of pass rates under 108 hyperparameter [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

read the original abstract

Benchmarking of open-source LLMs for hardware design focuses on which LLMs to use, while treating inference-time decoding configuration as a secondary concern. This work shows that it matters more how an LLM is configured than which model is selected. Benchmarking 26 open-source LLMs on VerilogEval and RTLLM with synthesis-in-the-loop evaluation, the study first maps the current capability landscape and then conducts an extensive 108-configuration hyperparameter sweep on three prominent models. The sweep reveals absolute pass-rate gaps of up to 25.5% between the best and worst settings for the same LLM, which is 5x larger than the average spread observed across various model families under their respective default configurations. Ranking all configurations by Spearman's $\rho$ across the two benchmark suites yields near-zero correlation, demonstrating that optimal configurations do not transfer. These results show that benchmarking conducted under default hyperparameters confounds model capabilities with configuration effects. Realizing the full potential of open-source LLMs for RTL generation requires architecture and benchmark aware hyperparameter selection, as enabled by the proposed methodology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Config sensitivity on three models beats default model spreads across 26, but the 5x generalization rests on a narrow base.

read the letter

The main takeaway is that for these LLMs on RTL tasks, changing decoding settings can move pass rates by as much as 25.5 percent while switching models under defaults moves them far less, and the settings that work on one benchmark do not carry over to the other. They back this with a landscape of 26 models plus a 108-run sweep on three of them, all checked through synthesis-in-the-loop on VerilogEval and RTLLM. The non-transfer result and the size of the config gaps are the concrete new pieces. Prior LLM-for-hardware papers have mostly reported single default runs, so this directly flags a confound that has been easy to miss. The mapping of current model capabilities is also useful as a snapshot. The soft spot is the comparison itself. The large config gaps come only from the three models that received the full sweep, while the model-to-model spread uses defaults for all 26. If those three are more sensitive than the rest, or if other models could close the gap with tuning, the headline that configuration matters more than selection does not yet apply to the full set. The abstract gives no error bars, run counts, or details on how the three models were chosen, so the numbers are harder to judge for robustness. This paper is for groups that evaluate or deploy LLMs for hardware code generation. Readers who care about fair benchmarking will take away a practical caution and a method for config-aware testing. It deserves a serious referee because the empirical warning is real and actionable even if the scope needs tightening in revision. I would send it for review with a request for more on the model selection and variance measures.

Referee Report

3 major / 1 minor

Summary. The paper benchmarks 26 open-source LLMs for RTL generation on VerilogEval and RTLLM with synthesis-in-the-loop evaluation. It first maps the capability landscape under default hyperparameters and then performs a 108-configuration hyperparameter sweep on three prominent models. The central claims are that absolute pass-rate gaps reach 25.5% within a single LLM (5x the average inter-model spread under defaults) and that optimal configurations show near-zero Spearman's ρ correlation across the two benchmarks, implying that configuration effects dominate model selection and that default-based benchmarking confounds results.

Significance. If the quantitative comparisons hold after addressing scope limitations, the work provides a useful empirical demonstration that inference-time decoding choices can outweigh model-family differences in LLM-based hardware design tasks. The synthesis-in-the-loop evaluation on two benchmarks and the use of rank correlation to test transferability are concrete strengths that could inform more careful benchmarking practices in the field.

major comments (3)

[Abstract and hyperparameter sweep results] Abstract and the results section presenting the 108-configuration sweep: the claim that configuration-induced gaps (up to 25.5%) are 5x larger than model-family spread is derived exclusively from sensitivity measured on three models, while the inter-model spread uses only default configurations across all 26. Without either extending the sweep to additional models or providing evidence that the three chosen models are representative of sensitivity levels, the generalization that 'it matters more how an LLM is configured than which model is selected' does not follow for the full set.
[Abstract and correlation analysis] Abstract and correlation analysis section: the near-zero Spearman's ρ demonstrating non-transfer of optimal configurations is computed only over the 108 configurations of the three swept models. This leaves open whether the lack of transfer holds when considering the broader 26-model landscape or when comparing across different model families.
[Methods and results on sweep] Methods and results sections describing the sweep: the manuscript reports specific quantitative outcomes (25.5% gaps, 5x factor) but provides no visible details on exact hyperparameter ranges explored, statistical tests used to establish significance of differences, or error bars on pass rates. These omissions make it impossible to verify that the best/worst settings were not post-hoc selections and that the reported gaps are robust.

minor comments (1)

[Capability landscape section] The description of the 26-model landscape could more explicitly state the default hyperparameter values used for each model family to allow direct replication of the baseline spread.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important aspects of generalizability and methodological transparency. We address each major comment point by point below and indicate the revisions planned for the next manuscript version.

read point-by-point responses

Referee: [Abstract and hyperparameter sweep results] Abstract and the results section presenting the 108-configuration sweep: the claim that configuration-induced gaps (up to 25.5%) are 5x larger than model-family spread is derived exclusively from sensitivity measured on three models, while the inter-model spread uses only default configurations across all 26. Without either extending the sweep to additional models or providing evidence that the three chosen models are representative of sensitivity levels, the generalization that 'it matters more how an LLM is configured than which model is selected' does not follow for the full set.

Authors: We acknowledge the distinction between the intra-model sensitivity analysis (three models) and the inter-model comparison (defaults across 26). The three models were deliberately selected as prominent, high-performing representatives from the initial 26-model benchmark, spanning different families, sizes, and training approaches (CodeLlama, DeepSeek-Coder, and WizardCoder variants). In the revision we will add explicit justification for this selection, including their relative rankings and diversity metrics from the default-configuration results. We will also qualify the abstract and results statements to emphasize that the 5x factor illustrates how configuration variation within these models exceeds default-based inter-model spreads, while noting that a full 108-configuration sweep on all 26 models is computationally prohibitive. This framing preserves the core empirical observation without overgeneralizing. revision: partial
Referee: [Abstract and correlation analysis] Abstract and correlation analysis section: the near-zero Spearman's ρ demonstrating non-transfer of optimal configurations is computed only over the 108 configurations of the three swept models. This leaves open whether the lack of transfer holds when considering the broader 26-model landscape or when comparing across different model families.

Authors: The Spearman's ρ calculation is intentionally performed over the full 108-configuration space per model to directly test whether optimal decoding settings transfer between benchmarks for the same LLM. Because exhaustive configurations are unavailable for the remaining 23 models, a direct extension is not possible. In the revision we will add a supplementary analysis comparing the relative ranking of the 26 models under their default configurations across VerilogEval and RTLLM; this shows similarly low rank correlation, providing supporting evidence that benchmark-specific effects persist even at the default level. We will clarify the scope of the ρ result in the text while retaining the demonstration that configuration non-transferability is observable within the models studied in depth. revision: partial
Referee: [Methods and results on sweep] Methods and results sections describing the sweep: the manuscript reports specific quantitative outcomes (25.5% gaps, 5x factor) but provides no visible details on exact hyperparameter ranges explored, statistical tests used to establish significance of differences, or error bars on pass rates. These omissions make it impossible to verify that the best/worst settings were not post-hoc selections and that the reported gaps are robust.

Authors: We agree that these details are essential for reproducibility and verification. The revised Methods section will explicitly enumerate the hyperparameter grid (temperature, top-p, top-k, repetition penalty, and beam size values yielding exactly 108 combinations), describe the exhaustive enumeration procedure used to identify best/worst settings, report bootstrap-derived 95% confidence intervals as error bars on all pass rates, and include the statistical tests (paired Wilcoxon signed-rank tests with Bonferroni correction) used to confirm significance of the observed gaps. These additions will be placed in both the main text and an expanded appendix. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivation chain or self-referential results

full rationale

The paper is a direct empirical study that measures pass rates for 26 LLMs under default settings and then performs a 108-configuration sweep on three models. All reported quantities (absolute pass-rate gaps, 5x comparison to model spread, Spearman's ρ near zero) are computed from raw benchmark outcomes with synthesis-in-the-loop. No equations, fitted parameters, ansatzes, or uniqueness theorems appear; the central claim is a straightforward comparison of observed variances. The methodology is self-contained against the stated benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen benchmarks validly measure RTL quality and that results from three models generalize. No free parameters are fitted; the study sweeps rather than optimizes. No new entities are postulated.

axioms (2)

domain assumption VerilogEval and RTLLM benchmarks with synthesis-in-the-loop accurately reflect practical RTL generation performance.
Invoked as the core evaluation metric throughout the abstract.
domain assumption Hyperparameter sensitivity observed in the three-model sweep applies to the broader set of 26 models.
The general claim about configuration over selection is extrapolated from the detailed sweep on three models.

pith-pipeline@v0.9.0 · 5515 in / 1525 out tokens · 68784 ms · 2026-05-10T05:55:29.029149+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
cs.CR 2026-05 unverdicted novelty 3.0

A survey of LLM applications in secure hardware design covering EDA synthesis, vulnerability analysis, countermeasures, and educational uses.
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
cs.CR 2026-05 accept novelty 2.0

LLMs enable RTL code generation and vulnerability analysis in hardware design but introduce data contamination and adversarial risks that require red-teaming and dynamic benchmarking.

Reference graph

Works this paper leans on

21 extracted references · 7 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Benchmarking large language models for auto- mated verilog rtl code generation,

S. Thakur, B. Ahmadet al., “Benchmarking large language models for automated verilog rtl code generation,” 2022. [Online]. Available: https://arxiv.org/abs/2212.11140

work page arXiv 2022
[2]

Llms and the future of chip design: Unveiling security risks and building trust,

Z. Wang, L. Alrahiset al., “Llms and the future of chip design: Unveiling security risks and building trust,” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2024, pp. 385–390

2024
[3]

Pluto: A benchmark for eval- uating efficiency of llm-generated hardware code,

M. Abdelatty, M. Nouhet al., “Pluto: A benchmark for eval- uating efficiency of llm-generated hardware code,”arXiv preprint arXiv:2510.14756, 2025

work page arXiv 2025
[4]

Origen: Enhancing rtl code generation with code- to-code augmentation and self-reflection,

F. Cui, C. Yinet al., “Origen: Enhancing rtl code generation with code- to-code augmentation and self-reflection,” inProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024, pp. 1–9

2024
[5]

Rtlcoder: Fully open-source and efficient llm- assisted rtl code generation technique,

S. Liu, W. Fanget al., “Rtlcoder: Fully open-source and efficient llm- assisted rtl code generation technique,”IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, vol. 44, no. 4, pp. 1448–1461, 2024

2024
[6]

Verileaky: Navigating ip protection vs utility in fine-tuning for llm-driven verilog coding,

Z. Wang, M. Shaoet al., “Verileaky: Navigating ip protection vs utility in fine-tuning for llm-driven verilog coding,” in2025 IEEE International Conference on LLM-Aided Design (ICLAD). IEEE, 2025, pp. 100–107

2025
[7]

Openllm-rtl: Open dataset and benchmark for llm-aided design rtl generation,

Y . Luet al., “Openllm-rtl: Open dataset and benchmark for llm-aided design rtl generation,”arXiv preprint arXiv:2404.06117, 2024

work page arXiv 2024
[8]

Hot or cold? adaptive temperature sampling for code generation with large language models,

Y . Zhu, J. Liet al., “Hot or cold? adaptive temperature sampling for code generation with large language models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 437–445

2024
[9]

Netdetox: Adversarial and efficient evasion of hardware-security gnns via rl-llm orchestration,

Z. Wang, M. Shaoet al., “Netdetox: Adversarial and efficient evasion of hardware-security gnns via rl-llm orchestration,”arXiv preprint arXiv:2512.00119, 2025

work page arXiv 2025
[10]

Survey of different large language model ar- chitectures: Trends, benchmarks, and challenges,

M. Shao, A. Basitet al., “Survey of different large language model ar- chitectures: Trends, benchmarks, and challenges,”IEEE access, vol. 12, pp. 188 664–188 706, 2024

2024
[11]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworeket al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Program Synthesis with Large Language Models

J. Austin, A. Odenaet al., “Program synthesis with large language models,”arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Rtllm: An open-source benchmark for design rtl generation with large language model,

Y . Lu, S. Liuet al., “Rtllm: An open-source benchmark for design rtl generation with large language model,” in2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2024, pp. 722–727

2024
[14]

Verilogeval: Evaluating large language models for verilog code generation,

M. Liu, N. Pinckneyet al., “Verilogeval: Evaluating large language models for verilog code generation,” in2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2023, pp. 1–8

2023
[15]

Verigen: A large language model for verilog code generation,

S. Thakur, B. Ahmadet al., “Verigen: A large language model for verilog code generation,”ACM Transactions on Design Automation of Electronic Systems, vol. 29, no. 3, pp. 1–31, 2024

2024
[16]

Mage: A multi-agent engine for automated rtl code generation,

Y . Zhao, H. Zhanget al., “Mage: A multi-agent engine for automated rtl code generation,” in2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

2025
[17]

Turtle: A unified evaluation of llms for rtl generation,

D. Garcia-Gasulla, G. Kestoret al., “Turtle: A unified evaluation of llms for rtl generation,” in2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD). IEEE, 2025, pp. 1–12

2025
[18]

Revisiting verilogeval: A year of improvements in large-language models for hardware code generation,

N. Pinckney, C. Battenet al., “Revisiting verilogeval: A year of improvements in large-language models for hardware code generation,” ACM Transactions on Design Automation of Electronic Systems, vol. 30, no. 6, pp. 1–20, 2025

2025
[19]

Yosys open SYnthesis suite,

C. Wolf, “Yosys open SYnthesis suite,” https://yosyshq.net/yosys/, 2013

2013
[20]

The NanGate 45nm open cell library,

Nangate Inc., “The NanGate 45nm open cell library,” https://si2.org/, 2008

2008
[21]

Synthesis-in-the-Loop Evaluation of LLMs for RTL Generation: Quality, Reliability, and Failure Modes

W. Fu, Z. Wanget al., “Synthesis-in-the-loop evaluation of llms for rtl generation: Quality, reliability, and failure modes,” 2026. [Online]. Available: https://arxiv.org/abs/2603.11287

work page internal anchor Pith review arXiv 2026