pith. sign in

arxiv: 2604.15606 · v2 · pith:QJIDUJWInew · submitted 2026-04-17 · 💻 cs.AR

Spec2Cov: An Agentic Framework for Code Coverage Closure of Digital Hardware Designs

Pith reviewed 2026-05-22 10:14 UTC · model grok-4.3

classification 💻 cs.AR
keywords hardware verificationcode coverage closuretest stimulus generationagentic workflowslarge language modelsdigital design validationsimulation feedback loop
0
0 comments X

The pith

An agentic LLM framework generates test stimuli from hardware design specifications and iteratively refines them via simulator feedback to close code coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Spec2Cov to automate the manual and slow process of coverage closure in hardware verification. It demonstrates that large language models can act as agents that produce test cases directly from specifications, interact with simulators, handle errors, and use coverage reports to improve results over multiple rounds. The work evaluates this approach across designs of different sizes and shows it reaches complete coverage on simpler cases while making substantial progress on harder ones. A sympathetic reader would care because this could reduce the time and human effort currently required to validate digital hardware before production.

Core claim

Spec2Cov is an agentic framework that automatically and iteratively generates test stimulus directly from design specifications by coordinating interactions between an LLM and a hardware simulator, managing compilation and simulation errors, parsing coverage reports, and feeding results back to the model for refinement without additional fine-tuning.

What carries the argument

The closed-loop agentic workflow that connects the LLM to the simulator for error management, coverage parsing, and iterative stimulus refinement.

If this is right

  • Verification teams could shift from manual test writing to overseeing automated coverage closure loops.
  • Simpler designs reach 100 percent coverage while complex ones reach up to 49 percent across the evaluated set.
  • Specific framework features improve performance without requiring model retraining.
  • The same loop structure applies to problems drawn from existing benchmark suites of varying complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to mixed-signal or system-level designs if the simulator interface remains stable.
  • Integration with existing constrained-random tools could further boost coverage on the hardest cases.
  • Success on larger designs would depend on how well future models handle longer context from detailed coverage reports.
  • Continuous verification pipelines could adopt the framework to maintain coverage as designs evolve.

Load-bearing premise

The LLM can interpret coverage reports and simulation errors well enough to generate improved test stimulus in later iterations without fine-tuning or human intervention.

What would settle it

A design where multiple iterations produce no measurable increase in coverage metrics despite the model receiving full simulator feedback and error logs each round.

Figures

Figures reproduced from arXiv: 2604.15606 by Alma Babbit, Aman Arora, Elias Hilaneh, Nakul Gopalan, Sean Lowe, Vidya Chhabria.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detailed flow of the Spec2Cov framework successful generation, the testcase is inserted into an auto￾generated testbench template (which instantiates the design and includes clock generation logic). The design and testbench are then passed to the simulator, which performs simulation with coverage metrics enabled. Compilation and runtime errors are fed back to the LLM for correction. Upon successful simulat… view at source ↗
Figure 3
Figure 3. Figure 3: Agentic approach achieves significantly higher coverage, whereas the single iteration approach fails to produce any coverage in some cases. GM = Geometric Mean best of our knowledge, there is no prior work that addresses automated testcase generation from specifications for closing code coverage for CVDP designs. Consequently, our results cannot be compared against a prior baseline, and we instead report a… view at source ↗
Figure 5
Figure 5. Figure 5: Average generation time and average total token usage in￾creases as design complexity in￾creases. VI. DISCUSSION Code Coverage Focus Spec2Cov targets code coverage because it is a standard sign-off gate before broader verifica￾tion closure, and manual closure remains a major verification bottleneck. Automating this stage provides immediate work￾flow impact and establishes a foundation for future functional… view at source ↗
read the original abstract

Hardware verification is one of the most challenging stages of the hardware design process, requiring significant time and resources to ensure a design is fully validated and production-ready. Verification teams aim to maximize design coverage while ensuring correct behavior and alignment with the specification. Coverage closure, which relies on iterative constrained-random and directed testing, is still largely manual and therefore slow and labor-intensive. Recent advances show that the code generation capabilities of Large Language Models (LLMs) can be integrated with external tools to build agentic workflows that autonomously perform hardware design and verification tasks. In this work, we introduce Spec2Cov, an agentic framework that automatically and iteratively generates test stimulus directly from design specifications to accelerate coverage closure. Spec2Cov coordinates interactions between an LLM and a hardware simulator, managing compilation and simulation errors, parsing coverage reports, and feeding results back to the model for refinement. We present features that improve Spec2Cov's effectiveness without additional fine-tuning and evaluate their impact. Across 26 designs of varying size and complexity, including problems from the CVDP benchmark suite, Spec2Cov demonstrates promising performance, achieving 100% coverage on simpler designs and up to 49% on more complex designs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Spec2Cov, an agentic framework that uses LLMs to automatically and iteratively generate test stimulus from design specifications for hardware verification. It coordinates with a simulator to handle compilation/simulation errors, parse coverage reports, and refine stimuli in a feedback loop without additional fine-tuning. Evaluation across 26 designs of varying size and complexity, including CVDP benchmarks, reports 100% coverage on simpler designs and up to 49% on more complex designs.

Significance. If the central empirical claims hold after addressing evaluation gaps, the work could have practical significance by automating a labor-intensive aspect of hardware verification. The demonstration of an iterative LLM-simulator loop with features to improve effectiveness without fine-tuning represents a relevant direction for integrating agentic AI into EDA workflows.

major comments (2)
  1. [Evaluation] Evaluation section: The central claim of 'promising performance' and acceleration of coverage closure rests on coverage numbers (100% on simple designs, up to 49% on complex ones) across 26 designs but supplies no baseline comparisons to conventional methods such as constrained-random testing or standard EDA flows on the same designs, simulators, and metrics. This omission is load-bearing, as the observed results could reflect design simplicity or default simulator behavior rather than the contribution of the agentic feedback loop.
  2. [Results] Results and discussion: The manuscript provides no statistical details, failure analysis, or exact coverage distributions for the 26 designs, leaving the 'promising performance' assertion under-supported and difficult to interpret or reproduce.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'up to 49%' should be supplemented with more precise metrics (e.g., mean coverage, variance, or per-design breakdown) to strengthen the empirical summary.
  2. [Framework] Framework description: Additional details on the specific LLM model, prompt templates, and exact parsing logic for coverage reports would improve reproducibility and clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and positive assessment of the potential impact of our work. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where appropriate.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The central claim of 'promising performance' and acceleration of coverage closure rests on coverage numbers (100% on simple designs, up to 49% on complex ones) across 26 designs but supplies no baseline comparisons to conventional methods such as constrained-random testing or standard EDA flows on the same designs, simulators, and metrics. This omission is load-bearing, as the observed results could reflect design simplicity or default simulator behavior rather than the contribution of the agentic feedback loop.

    Authors: We concur that baseline comparisons are essential to substantiate the claims regarding the acceleration of coverage closure. In the revised version, we will include results from constrained-random testing on the same set of designs and using the identical simulator setup. For the simpler designs where we achieve 100% coverage, we will demonstrate that random testing alone does not reach full coverage within comparable simulation budgets. For complex designs, we will report the coverage achieved by standard methods to highlight the relative improvement from the agentic approach. We note that implementing full standard EDA flows may be beyond the scope, but these additions will address the core concern. revision: yes

  2. Referee: [Results] Results and discussion: The manuscript provides no statistical details, failure analysis, or exact coverage distributions for the 26 designs, leaving the 'promising performance' assertion under-supported and difficult to interpret or reproduce.

    Authors: We acknowledge this limitation in the current manuscript. We will expand the results section to include exact coverage percentages for each of the 26 designs, along with any statistical measures such as averages over multiple runs if applicable. A new subsection on failure analysis will be added, discussing the designs where coverage fell short of 100%, including factors like design complexity, specification ambiguity, or LLM limitations in generating effective stimuli. This will provide better support for our assertions and aid reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework demonstration with results grounded in external simulator outcomes

full rationale

The paper introduces Spec2Cov as an agentic LLM-simulator workflow and reports coverage results across 26 designs. No mathematical derivations, equations, fitted parameters, or self-referential predictions appear. Coverage numbers are produced by running the framework against a hardware simulator and parsing its reports; these outcomes are independent of any internal definition or self-citation chain. The evaluation is therefore self-contained against external benchmarks (simulator behavior on the given designs) and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that current LLMs possess sufficient zero-shot reasoning to generate valid hardware test stimulus and usefully refine it from simulator feedback without fine-tuning.

axioms (1)
  • domain assumption Large language models can parse hardware design specifications and generate syntactically and semantically correct test stimulus that a simulator can execute.
    This capability is required for the agent to produce initial tests and respond to coverage feedback.
invented entities (1)
  • Spec2Cov agentic loop no independent evidence
    purpose: To manage iterative interactions between LLM and hardware simulator for coverage closure.
    The framework itself is the primary contribution; its effectiveness is shown only through the reported runs rather than independent validation.

pith-pipeline@v0.9.0 · 5757 in / 1312 out tokens · 45846 ms · 2026-05-22T10:14:52.750833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Chip-Chat: Chal- lenges and Opportunities in Conversational Hardware Design,

    J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Chip-Chat: Chal- lenges and Opportunities in Conversational Hardware Design,” in2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD), Sep. 2023, pp. 1–6

  2. [2]

    VerilogReader: LLM-Aided Hardware Test Generation,

    R. Ma, Y . Yang, Z. Liu, J. Zhang, M. Li, J. Huang, and G. Luo, “VerilogReader: LLM-Aided Hardware Test Generation,” in2024 IEEE LLM Aided Design Workshop (LAD), Jun. 2024, pp. 1–5

  3. [3]

    Au- toBench: Automatic Testbench Generation and Evaluation Using LLMs for HDL Design,

    R. Qiu, G. L. Zhang, R. Drechsler, U. Schlichtmann, and B. Li, “Au- toBench: Automatic Testbench Generation and Evaluation Using LLMs for HDL Design,” inProceedings of the 2024 ACM/IEEE International Symposium on Machine Learning for CAD, ser. MLCAD ’24. New York, NY , USA: Association for Computing Machinery, Sep. 2024, pp. 1–10

  4. [4]

    LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation,

    Z. Zhang, G. Chadwick, H. McNally, Y . Zhao, and R. Mullins, “LLM4DV: Using Large Language Models for Hardware Test Stimuli Generation,” Oct. 2023

  5. [5]

    ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Gen- eration,

    B. Mali, K. Maddala, V . Gupta, S. Reddy, C. Karfa, and R. Karri, “ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Gen- eration,” in2024 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). Knoxville, TN, USA: IEEE, Jul. 2024, pp. 680–683

  6. [6]

    AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs,

    W. Fang, M. Li, M. Li, Z. Yan, S. Liu, H. Zhang, and Z. Xie, “AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs,” in2024 IEEE LLM Aided Design Workshop (LAD), Jun. 2024, pp. 1–1

  7. [7]

    Illm4dv: Using large language models for hardware test stimuli gener- ation

    “Illm4dv: Using large language models for hardware test stimuli gener- ation.”

  8. [8]

    Prompt. Verify. Repeat. LLMs in the Hardware Verification Cycle,

    M. Hassan, M. Nadeem, K. Qayyum, C. K. Jha, and R. Drechsler, “Prompt. Verify. Repeat. LLMs in the Hardware Verification Cycle,” in 2025 IEEE International Conference on Omni-layer Intelligent Systems (COINS), 2025, pp. 1–6

  9. [9]

    From Concept to Practice: an Automated LLM-aided UVM Machine for RTL Verification

    J. Ye, Y . Hu, K. Xu, D. Pan, Q. Chen, J. Zhou, S. Zhao, X. Fang, X. Wang, N. Guan, and Z. Jiang, “From Concept to Practice: an Automated LLM-aided UVM Machine for RTL Verification,” 2025. [Online]. Available: https://arxiv.org/abs/2504.19959

  10. [10]

    Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification,

    N. Pinckney, C. Deng, C.-T. Ho, Y .-D. Tsai, M. Liu, W. Zhou, B. Khailany, and H. Ren, “Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification,” 2025. [Online]. Available: https://arxiv.org/abs/2506.14074

  11. [11]

    Spec2Cov,

    Anonymous, “Spec2Cov,” 2025, https://anonymous.4open.science/r/spec2cov

  12. [12]

    VeriGen: A Large Language Model for Verilog Code Generation,

    S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan-Gavitt, R. Karri, and S. Garg, “VeriGen: A Large Language Model for Verilog Code Generation,”ACM Transactions on Design Automation of Electronic Systems, p. 3643681, Feb. 2024

  13. [13]

    GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models,

    Y . Fu, Y . Zhang, Z. Yu, S. Li, Z. Ye, C. Li, C. Wan, and Y . C. Lin, “GPT4AIGChip: Towards Next-Generation AI Accelerator Design Automation via Large Language Models,” 2025. [Online]. Available: https://arxiv.org/abs/2309.10730

  14. [14]

    VerilogDB: The Largest, Highest- Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation,

    P. E. Calzada, Z. Ibnat, T. Rahman, K. Kandula, D. Lu, S. K. Saha, F. Farahmandi, and M. Tehranipoor, “VerilogDB: The Largest, Highest- Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation,” 2025. [Online]. Available: https://arxiv.org/abs/2507.13369

  15. [15]

    K. Xu, J. Sun, Y . Hu, X. Fang, W. Shan, X. Wang, and Z. Jiang,MEIC: Re-thinking RTL Debug Automation using LLMs. New York, NY , USA: Association for Computing Machinery, 2025. [Online]. Available: https://doi-org.ezproxy1.lib.asu.edu/10.1145/3676536.3676801

  16. [16]

    Uvllm: An automated universal rtl verification framework using llms,

    Y . Hu, J. Ye, K. Xu, J. Sun, S. Zhang, X. Jiao, D. Pan, J. Zhou, N. Wang, W. Shan, X. Fang, X. Wang, N. Guan, and Z. Jiang, “UVLLM: An Automated Universal RTL Verification Framework using LLMs,” 2024. [Online]. Available: https://arxiv.org/abs/2411.16238

  17. [17]

    Efficient Memory Management for Large Language Model Serving with PagedAttention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  18. [18]

    vndecorrelator: Verilog implementation of a von Neu- mann decorrelator,

    J. Str ¨ombergson, “vndecorrelator: Verilog implementation of a von Neu- mann decorrelator,” https://github.com/secworks/vndecorrelator, 2016

  19. [19]

    FIFO SystemVerilog Assertion: Syn- chronous FIFO with SystemVerilog Assertions,

    A. Vashist, “FIFO SystemVerilog Assertion: Syn- chronous FIFO with SystemVerilog Assertions,” https://github.com/avashist003/FIFO SystemVerilog Assertion, 2020

  20. [20]

    uart: Verilog implementation of a simple UART core,

    J. Str ¨ombergson, “uart: Verilog implementation of a simple UART core,” https://github.com/secworks/uart, 2014

  21. [21]

    sha1: Verilog implementation of the SHA-1 hash function,

    J. Str ¨ombergson, “sha1: Verilog implementation of the SHA-1 hash function,” https://github.com/secworks/sha1, 2014

  22. [22]

    chacha: Verilog implementation of the ChaCha stream cipher,

    J. Str ¨ombergson, “chacha: Verilog implementation of the ChaCha stream cipher,” https://github.com/secworks/chacha, 2014

  23. [23]

    trng: True Random Number Generator core imple- mented in Verilog,

    J. Str ¨ombergson, “trng: True Random Number Generator core imple- mented in Verilog,” https://github.com/secworks/trng, 2014

  24. [24]

    SD-card-controller: SD/SDHC card controller for Wish- bone bus,

    M. Czerski, “SD-card-controller: SD/SDHC card controller for Wish- bone bus,” https://github.com/mczerski/SD-card-controller, 2013

  25. [25]

    DSP Slice: Floating Point Units,

    S. Mehta, “DSP Slice: Floating Point Units,” https://github.com/samidhm/DSP Slice/tree/main/Floating Point Units, 2020

  26. [26]

    tpu like design: TPU-like design with pooling unit,

    V . Patel and UT-LCA, “tpu like design: TPU-like design with pooling unit,” https://github.com/UT- LCA/tpu like design/tree/master/design ws vedant, 2019