Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs
Pith reviewed 2026-05-22 10:14 UTC · model grok-4.3
The pith
An agentic LLM framework generates test stimuli from hardware design specifications and iteratively refines them via simulator feedback to close code coverage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Spec2Cov is an agentic framework that automatically and iteratively generates test stimulus directly from design specifications by coordinating interactions between an LLM and a hardware simulator, managing compilation and simulation errors, parsing coverage reports, and feeding results back to the model for refinement without additional fine-tuning.
What carries the argument
The closed-loop agentic workflow that connects the LLM to the simulator for error management, coverage parsing, and iterative stimulus refinement.
If this is right
- Verification teams could shift from manual test writing to overseeing automated coverage closure loops.
- Simpler designs reach 100 percent coverage while complex ones reach up to 49 percent across the evaluated set.
- Specific framework features improve performance without requiring model retraining.
- The same loop structure applies to problems drawn from existing benchmark suites of varying complexity.
Where Pith is reading between the lines
- The approach may extend to mixed-signal or system-level designs if the simulator interface remains stable.
- Integration with existing constrained-random tools could further boost coverage on the hardest cases.
- Success on larger designs would depend on how well future models handle longer context from detailed coverage reports.
- Continuous verification pipelines could adopt the framework to maintain coverage as designs evolve.
Load-bearing premise
The LLM can interpret coverage reports and simulation errors well enough to generate improved test stimulus in later iterations without fine-tuning or human intervention.
What would settle it
A design where multiple iterations produce no measurable increase in coverage metrics despite the model receiving full simulator feedback and error logs each round.
Figures
read the original abstract
Hardware verification is one of the most challenging stages of the hardware design process, requiring significant time and resources to ensure a design is fully validated and production-ready. Verification teams aim to maximize design coverage while ensuring correct behavior and alignment with the specification. Coverage closure, which relies on iterative constrained-random and directed testing, is still largely manual and therefore slow and labor-intensive. Recent advances show that the code generation capabilities of Large Language Models (LLMs) can be integrated with external tools to build agentic workflows that autonomously perform hardware design and verification tasks. In this work, we introduce Spec2Cov, an agentic framework that automatically and iteratively generates test stimulus directly from design specifications to accelerate coverage closure. Spec2Cov coordinates interactions between an LLM and a hardware simulator, managing compilation and simulation errors, parsing coverage reports, and feeding results back to the model for refinement. We present features that improve Spec2Cov's effectiveness without additional fine-tuning and evaluate their impact. Across 26 designs of varying size and complexity, including problems from the CVDP benchmark suite, Spec2Cov demonstrates promising performance, achieving 100% coverage on simpler designs and up to 49% on more complex designs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Spec2Cov, an agentic framework that uses LLMs to automatically and iteratively generate test stimulus from design specifications for hardware verification. It coordinates with a simulator to handle compilation/simulation errors, parse coverage reports, and refine stimuli in a feedback loop without additional fine-tuning. Evaluation across 26 designs of varying size and complexity, including CVDP benchmarks, reports 100% coverage on simpler designs and up to 49% on more complex designs.
Significance. If the central empirical claims hold after addressing evaluation gaps, the work could have practical significance by automating a labor-intensive aspect of hardware verification. The demonstration of an iterative LLM-simulator loop with features to improve effectiveness without fine-tuning represents a relevant direction for integrating agentic AI into EDA workflows.
major comments (2)
- [Evaluation] Evaluation section: The central claim of 'promising performance' and acceleration of coverage closure rests on coverage numbers (100% on simple designs, up to 49% on complex ones) across 26 designs but supplies no baseline comparisons to conventional methods such as constrained-random testing or standard EDA flows on the same designs, simulators, and metrics. This omission is load-bearing, as the observed results could reflect design simplicity or default simulator behavior rather than the contribution of the agentic feedback loop.
- [Results] Results and discussion: The manuscript provides no statistical details, failure analysis, or exact coverage distributions for the 26 designs, leaving the 'promising performance' assertion under-supported and difficult to interpret or reproduce.
minor comments (2)
- [Abstract] Abstract: The phrase 'up to 49%' should be supplemented with more precise metrics (e.g., mean coverage, variance, or per-design breakdown) to strengthen the empirical summary.
- [Framework] Framework description: Additional details on the specific LLM model, prompt templates, and exact parsing logic for coverage reports would improve reproducibility and clarity.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments and positive assessment of the potential impact of our work. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where appropriate.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The central claim of 'promising performance' and acceleration of coverage closure rests on coverage numbers (100% on simple designs, up to 49% on complex ones) across 26 designs but supplies no baseline comparisons to conventional methods such as constrained-random testing or standard EDA flows on the same designs, simulators, and metrics. This omission is load-bearing, as the observed results could reflect design simplicity or default simulator behavior rather than the contribution of the agentic feedback loop.
Authors: We concur that baseline comparisons are essential to substantiate the claims regarding the acceleration of coverage closure. In the revised version, we will include results from constrained-random testing on the same set of designs and using the identical simulator setup. For the simpler designs where we achieve 100% coverage, we will demonstrate that random testing alone does not reach full coverage within comparable simulation budgets. For complex designs, we will report the coverage achieved by standard methods to highlight the relative improvement from the agentic approach. We note that implementing full standard EDA flows may be beyond the scope, but these additions will address the core concern. revision: yes
-
Referee: [Results] Results and discussion: The manuscript provides no statistical details, failure analysis, or exact coverage distributions for the 26 designs, leaving the 'promising performance' assertion under-supported and difficult to interpret or reproduce.
Authors: We acknowledge this limitation in the current manuscript. We will expand the results section to include exact coverage percentages for each of the 26 designs, along with any statistical measures such as averages over multiple runs if applicable. A new subsection on failure analysis will be added, discussing the designs where coverage fell short of 100%, including factors like design complexity, specification ambiguity, or LLM limitations in generating effective stimuli. This will provide better support for our assertions and aid reproducibility. revision: yes
Circularity Check
No circularity: empirical framework demonstration with results grounded in external simulator outcomes
full rationale
The paper introduces Spec2Cov as an agentic LLM-simulator workflow and reports coverage results across 26 designs. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear. Coverage numbers are produced by running the framework against a hardware simulator and parsing its reports; these outcomes are independent of any internal definition or self-citation chain. The evaluation is therefore self-contained against external benchmarks (simulator behavior on the given designs) and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can parse hardware design specifications and generate syntactically and semantically correct test stimulus that a simulator can execute.
invented entities (1)
-
Spec2Cov agentic loop
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Spec2Cov coordinates interactions between an LLM and a hardware simulator, managing compilation and simulation errors, parsing coverage reports, and feeding results back to the model for refinement.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
achieving 100% coverage on simpler designs and up to 49% on more complex designs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chip-Chat: Chal- lenges and Opportunities in Conversational Hardware Design,
J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Chip-Chat: Chal- lenges and Opportunities in Conversational Hardware Design,” in2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD), Sep. 2023, pp. 1–6
work page 2023
-
[2]
VerilogReader: LLM-Aided Hardware Test Generation,
R. Ma, Y . Yang, Z. Liu, J. Zhang, M. Li, J. Huang, and G. Luo, “VerilogReader: LLM-Aided Hardware Test Generation,” in2024 IEEE LLM Aided Design Workshop (LAD), Jun. 2024, pp. 1–5
work page 2024
-
[3]
Au- toBench: Automatic Testbench Generation and Evaluation Using LLMs for HDL Design,
R. Qiu, G. L. Zhang, R. Drechsler, U. Schlichtmann, and B. Li, “Au- toBench: Automatic Testbench Generation and Evaluation Using LLMs for HDL Design,” inProceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, ser. MLCAD ’24. New York, NY , USA: Association for Computing Machinery, Sep. 2024, pp. 1–10
work page 2024
-
[4]
LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation,
Z. Zhang, G. Chadwick, H. McNally, Y . Zhao, and R. Mullins, “LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation,” Oct. 2023
work page 2023
-
[5]
ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Gen- eration,
B. Mali, K. Maddala, V . Gupta, S. Reddy, C. Karfa, and R. Karri, “ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Gen- eration,” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). Knoxville, TN, USA: IEEE, Jul. 2024, pp. 680–683
work page 2024
-
[6]
AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs,
W. Fang, M. Li, M. Li, Z. Yan, S. Liu, H. Zhang, and Z. Xie, “AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs,” in2024 IEEE LLM Aided Design Workshop (LAD), Jun. 2024, pp. 1–1
work page 2024
-
[7]
Illm4dv: Using large language models for hardware test stimuli gener- ation
“Illm4dv: Using large language models for hardware test stimuli gener- ation.”
-
[8]
Prompt. Verify. Repeat. LLMs in the Hardware Verification Cycle,
M. Hassan, M. Nadeem, K. Qayyum, C. K. Jha, and R. Drechsler, “Prompt. Verify. Repeat. LLMs in the Hardware Verification Cycle,” in 2025 IEEE International Conference on Omni-layer Intelligent Systems (COINS), 2025, pp. 1–6
work page 2025
-
[9]
From Concept to Practice: an Automated LLM-aided UVM Machine for RTL Verification
J. Ye, Y . Hu, K. Xu, D. Pan, Q. Chen, J. Zhou, S. Zhao, X. Fang, X. Wang, N. Guan, and Z. Jiang, “From Concept to Practice: an Automated LLM-aided UVM Machine for RTL Verification,” 2025. [Online]. Available: https://arxiv.org/abs/2504.19959
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
N. Pinckney, C. Deng, C.-T. Ho, Y .-D. Tsai, M. Liu, W. Zhou, B. Khailany, and H. Ren, “Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification,” 2025. [Online]. Available: https://arxiv.org/abs/2506.14074
- [11]
-
[12]
VeriGen: A Large Language Model for Verilog Code Generation,
S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “VeriGen: A Large Language Model for Verilog Code Generation,”ACM Transactions on Design Automation of Electronic Systems, p. 3643681, Feb. 2024
work page 2024
-
[13]
GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models,
Y . Fu, Y . Zhang, Z. Yu, S. Li, Z. Ye, C. Li, C. Wan, and Y . C. Lin, “GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models,” 2025. [Online]. Available: https://arxiv.org/abs/2309.10730
-
[14]
P. E. Calzada, Z. Ibnat, T. Rahman, K. Kandula, D. Lu, S. K. Saha, F. Farahmandi, and M. Tehranipoor, “VerilogDB: The Largest, Highest- Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation,” 2025. [Online]. Available: https://arxiv.org/abs/2507.13369
-
[15]
K. Xu, J. Sun, Y . Hu, X. Fang, W. Shan, X. Wang, and Z. Jiang,MEIC: Re-thinking RTL Debug Automation using LLMs. New York, NY , USA: Association for Computing Machinery, 2025. [Online]. Available: https://doi-org.ezproxy1.lib.asu.edu/10.1145/3676536.3676801
-
[16]
Uvllm: An automated universal rtl verification framework using llms,
Y . Hu, J. Ye, K. Xu, J. Sun, S. Zhang, X. Jiao, D. Pan, J. Zhou, N. Wang, W. Shan, X. Fang, X. Wang, N. Guan, and Z. Jiang, “UVLLM: An Automated Universal RTL Verification Framework using LLMs,” 2024. [Online]. Available: https://arxiv.org/abs/2411.16238
-
[17]
Efficient Memory Management for Large Language Model Serving with PagedAttention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[18]
vndecorrelator: Verilog implementation of a von Neu- mann decorrelator,
J. Str ¨ombergson, “vndecorrelator: Verilog implementation of a von Neu- mann decorrelator,” https://github.com/secworks/vndecorrelator, 2016
work page 2016
-
[19]
FIFO SystemVerilog Assertion: Syn- chronous FIFO with SystemVerilog Assertions,
A. Vashist, “FIFO SystemVerilog Assertion: Syn- chronous FIFO with SystemVerilog Assertions,” https://github.com/avashist003/FIFO SystemVerilog Assertion, 2020
work page 2020
-
[20]
uart: Verilog implementation of a simple UART core,
J. Str ¨ombergson, “uart: Verilog implementation of a simple UART core,” https://github.com/secworks/uart, 2014
work page 2014
-
[21]
sha1: Verilog implementation of the SHA-1 hash function,
J. Str ¨ombergson, “sha1: Verilog implementation of the SHA-1 hash function,” https://github.com/secworks/sha1, 2014
work page 2014
-
[22]
chacha: Verilog implementation of the ChaCha stream cipher,
J. Str ¨ombergson, “chacha: Verilog implementation of the ChaCha stream cipher,” https://github.com/secworks/chacha, 2014
work page 2014
-
[23]
trng: True Random Number Generator core imple- mented in Verilog,
J. Str ¨ombergson, “trng: True Random Number Generator core imple- mented in Verilog,” https://github.com/secworks/trng, 2014
work page 2014
-
[24]
SD-card-controller: SD/SDHC card controller for Wish- bone bus,
M. Czerski, “SD-card-controller: SD/SDHC card controller for Wish- bone bus,” https://github.com/mczerski/SD-card-controller, 2013
work page 2013
-
[25]
DSP Slice: Floating Point Units,
S. Mehta, “DSP Slice: Floating Point Units,” https://github.com/samidhm/DSP Slice/tree/main/Floating Point Units, 2020
work page 2020
-
[26]
tpu like design: TPU-like design with pooling unit,
V . Patel and UT-LCA, “tpu like design: TPU-like design with pooling unit,” https://github.com/UT- LCA/tpu like design/tree/master/design ws vedant, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.