Recognition: unknown
Knowledge Graphs, the Missing Link in Agentic AI-based Formal Verification
Pith reviewed 2026-05-08 09:50 UTC · model grok-4.3
The pith
Knowledge graphs supply traceable links between natural-language specs, RTL designs, and tool feedback so LLM agents can generate and refine SystemVerilog assertions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A verification-centric knowledge graph constructed from structured intermediate representations extracted from natural-language specifications, RTL, and formal-tool feedback supplies traceable context that allows a multi-agent LLM workflow to generate SystemVerilog Assertions and execute syntax-repair, counterexample-correction, and coverage-augmentation loops, improving specification-to-RTL grounding and producing assertions with formal coverage ranging from 78.5 percent to 99.4 percent.
What carries the argument
A verification-centric Knowledge Graph built from intermediate representations of specifications, RTL hierarchy, signals, assumptions, and properties, which a multi-agent workflow queries and updates to drive three refinement loops.
If this is right
- KG-based retrieval reduces syntax failures and repair overhead during assertion generation.
- The workflow produces consistently compilable SVAs that achieve formal coverage from 78.5 to 99.4 percent across the tested designs.
- Traceable links in the graph enable targeted counterexample correction and coverage-directed property augmentation.
- Convergence speed and final coverage still depend on design complexity, with temporal and arithmetic reasoning remaining difficult for current LLMs.
Where Pith is reading between the lines
- The same graph structure could be reused to guide other agentic tasks such as automatic test generation or bug localization in the same design.
- Design-specific performance variation suggests that adding domain-specific arithmetic reasoning modules to the agents might further raise coverage on complex blocks.
- If the KG construction step proves reliable across more designs, the method could shorten the time from English requirements to provable properties in hardware projects.
Load-bearing premise
Intermediate representations from specifications, RTL, and tool feedback can be turned into a knowledge graph that supplies enough traceable context for LLM agents to perform complex temporal and arithmetic reasoning.
What would settle it
Running the same seven benchmarks with the identical multi-agent workflow but without the knowledge graph and measuring whether syntax-repair overhead rises and formal coverage falls below the reported 78.5-99.4 percent range.
Figures
read the original abstract
Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-quality assertion synthesis remains challenging because specifications are often ambiguous or incomplete and critical micro-architectural details reside in the Register Transfer Level (RTL). Many existing approaches treat the specification and RTL as loosely structured text, which weakens specification-to-RTL grounding and leads to semantic mismatches and frequent syntax failures during formal parsing and elaboration. This work addresses these limitations with a verification-centric Knowledge Graph (KG) constructed from structured Intermediate Representations (IRs) extracted from the specification, RTL, and formal-tool feedback, including syntax diagnostics, Counterexamples (CEXs), and coverage reports. The KG links requirements, design hierarchy, signals, assumptions, and properties to provide traceable, design-grounded context for generation. A multi-agent workflow queries and updates this KG to generate SVAs and to drive three refinement loops: syntax repair guided by tool diagnostics, CEX-guided correction using trace links, and coverage-directed property augmentation. Evaluation across seven benchmark designs indicates that KG-based context retrieval improves specification-to-RTL grounding and consistently produces compilable SVAs with low syntax-repair overhead. The approach achieves formal coverage ranging from 78.5% to 99.4%, though convergence exhibits design dependence with complex temporal and arithmetic reasoning remaining challenging for current LLM capabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a verification-centric Knowledge Graph (KG) constructed from structured intermediate representations (IRs) extracted from natural-language specifications, RTL code, and formal-tool feedback (syntax diagnostics, CEXs, coverage reports). A multi-agent LLM workflow queries and updates this KG to synthesize SystemVerilog Assertions (SVAs) and drives three refinement loops: syntax repair, CEX-guided correction, and coverage-directed augmentation. Evaluation on seven benchmark designs reports formal coverage of 78.5–99.4% with low syntax-repair overhead, claiming that KG-based context retrieval improves specification-to-RTL grounding over text-only approaches.
Significance. If the central claim can be isolated from confounding factors, the work offers a practical engineering contribution to agentic AI for formal verification by demonstrating how a traceable KG can supply design-grounded context to LLM agents, potentially reducing semantic mismatches and repair iterations in SVA generation. The empirical results on multiple benchmarks highlight both the promise and the remaining challenges with complex temporal/arithmetic reasoning.
major comments (3)
- [Evaluation] Evaluation section: the reported coverage range (78.5–99.4%) and low syntax-repair overhead are presented without any ablation that disables KG queries while retaining the same multi-agent loops, IR extraction, and refinement mechanisms (or substitutes flat-text concatenation). This prevents attribution of gains specifically to the KG structure rather than iterative feedback or prompting.
- [Evaluation] The manuscript acknowledges design dependence and difficulty with complex temporal/arithmetic reasoning but provides no quantitative breakdown of per-design variance, failure modes, or statistical details (e.g., run-to-run variance, number of trials, or confidence intervals) for the coverage metrics.
- [Approach and Evaluation] The central claim that KG-based retrieval 'improves specification-to-RTL grounding' rests on the weakest assumption that IRs can be reliably structured into a KG supplying sufficient traceable context; no evidence is given that the KG construction itself is robust or that alternative context mechanisms were tested.
minor comments (2)
- [Abstract and Evaluation] The abstract and introduction use 'design dependence' without defining the specific designs or metrics used to quantify it; a table or subsection clarifying per-benchmark convergence behavior would improve clarity.
- [Approach] Notation for the KG nodes/edges (requirements, signals, properties) is introduced informally; an early figure or table defining the schema would aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the evaluation and clarify the contribution of the knowledge graph. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the reported coverage range (78.5–99.4%) and low syntax-repair overhead are presented without any ablation that disables KG queries while retaining the same multi-agent loops, IR extraction, and refinement mechanisms (or substitutes flat-text concatenation). This prevents attribution of gains specifically to the KG structure rather than iterative feedback or prompting.
Authors: We agree that an explicit ablation isolating the KG's role is necessary to attribute performance gains specifically to the structured retrieval mechanism. In the revised manuscript we will add a new ablation experiment that retains the identical multi-agent workflow, IR extraction pipeline, and three refinement loops but replaces KG queries with flat-text concatenation of the same extracted IRs. We will report the resulting formal coverage and syntax-repair overhead for direct comparison with the KG-based results. This addition will be placed in the Evaluation section alongside the existing benchmark tables. revision: yes
-
Referee: [Evaluation] The manuscript acknowledges design dependence and difficulty with complex temporal/arithmetic reasoning but provides no quantitative breakdown of per-design variance, failure modes, or statistical details (e.g., run-to-run variance, number of trials, or confidence intervals) for the coverage metrics.
Authors: We acknowledge the value of more granular statistical reporting. The current results reflect single-run executions per design, driven by the high cost of repeated LLM calls and formal-tool invocations. In revision we will expand the Evaluation section with a per-design table that includes individual coverage percentages, a categorized list of observed failure modes (e.g., incomplete temporal property coverage on arithmetic-heavy modules), and any run-to-run variance observed in preliminary multi-trial tests on two of the smaller benchmarks. We will also add a brief discussion of confidence-interval estimation and note the practical constraints on exhaustive statistical sampling. revision: partial
-
Referee: [Approach and Evaluation] The central claim that KG-based retrieval 'improves specification-to-RTL grounding' rests on the weakest assumption that IRs can be reliably structured into a KG supplying sufficient traceable context; no evidence is given that the KG construction itself is robust or that alternative context mechanisms were tested.
Authors: We accept that the manuscript would benefit from greater transparency on KG construction robustness and from explicit comparison language. Section 3 already details the IR extraction rules and entity-linking procedure; we will augment this with a new subsection on potential failure modes (LLM parsing inaccuracies, incomplete signal extraction) and the mitigation strategies employed by the update loops. While we did not evaluate alternative context stores (e.g., vector databases), the traceable links are essential for the CEX-guided and coverage-directed refinement loops described in Section 4. We will revise the abstract and introduction to frame the improvement as demonstrated through the integrated workflow rather than as a standalone claim about KG superiority, and we will add a limitations paragraph on the scope of the current evidence. revision: yes
Circularity Check
Empirical engineering pipeline with no derivation chain or self-referential reductions
full rationale
The paper presents a multi-agent workflow that builds a KG from IRs extracted from natural-language specs, RTL, and tool feedback (syntax diagnostics, CEXs, coverage reports), then uses it to generate and refine SVAs via three loops. Evaluation on seven benchmarks reports coverage 78.5–99.4% and low syntax-repair overhead. No equations, fitted parameters, or first-principles predictions appear; claims rest directly on experimental outcomes. No self-citations are used as load-bearing uniqueness theorems or to smuggle ansatzes. The work is self-contained and falsifiable via replication on the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Structured IRs can be extracted from specifications, RTL, and formal-tool feedback to form a useful verification-centric KG
- domain assumption LLM agents can effectively query and update the KG to generate and refine SVAs including complex temporal reasoning
invented entities (1)
-
Verification-centric Knowledge Graph
no independent evidence
Forward citations
Cited by 2 Pith papers
-
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
A survey of LLM applications in secure hardware design covering EDA synthesis, vulnerability analysis, countermeasures, and educational uses.
-
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
LLMs enable RTL code generation and vulnerability analysis in hardware design but introduce data contamination and adversarial risks that require red-teaming and dynamic benchmarking.
Reference graph
Works this paper leans on
-
[1]
(Security) Assertions by Large Language Models,
R. Kande et al., “(Security) Assertions by Large Language Models,” IEEE Transactions on Information Forensics and Security, 2024
2024
-
[2]
AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs,
Z. Yan et al., “AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs,” inProceedings of the 30th ASPDAC Conference, ACM, 2025
2025
-
[3]
ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation,
B. Mali et al., “ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation,” inIEEE ISVLSI, 2024
2024
-
[4]
Saarthi: The First AI Formal Verification Engineer,
A. Kumar et al., “Saarthi: The First AI Formal Verification Engineer,” inDVCon US, 2025
2025
-
[5]
Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification,
D. N. Gadde et al., “Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification,” in38th SBCCI, 2025
2025
-
[6]
AssertionForge: Enhancing Formal Verification Assertion Generation with Structured Representation of Specifications and RTL,
Y . Bai et al., “AssertionForge: Enhancing Formal Verification Assertion Generation with Structured Representation of Specifications and RTL,” inIEEE ICLAD, 2025
2025
-
[7]
Spec2Assertion: Automatic Pre-RTL Assertion Generation using LLMs with Progressive Regularization,
F. Wu et al., “Spec2Assertion: Automatic Pre-RTL Assertion Generation using LLMs with Progressive Regularization,” arXiv, 2025
2025
-
[8]
LASP: LLM Assisted Security Property Generation for SoC Verification,
A. Ayalasomayajula et al., “LASP: LLM Assisted Security Property Generation for SoC Verification,” inACM/IEEE 6th MLCAD, 2024
2024
-
[9]
AssertMiner: Module-Level Spec Generation and Asser- tion Mining using Static Analysis Guided LLMs,
H. Lyu et al., “AssertMiner: Module-Level Spec Generation and Asser- tion Mining using Static Analysis Guided LLMs,” arXiv, 2025
2025
-
[10]
Enhanced VLSI Assertion Generation: Conforming to High-Level Specifications and Reducing LLM Hallucinations with RAG,
H. A. Quddus et al., “Enhanced VLSI Assertion Generation: Conforming to High-Level Specifications and Reducing LLM Hallucinations with RAG,” inDVCon Europe, 2024
2024
-
[11]
LISA: LLM Informed Systemverilog Assertion generation with RAG and Chain-of-Thought,
S. Paul et al., “LISA: LLM Informed Systemverilog Assertion generation with RAG and Chain-of-Thought,” inIEEE ISVLSI, 2025
2025
-
[12]
Are LLMs Ready for Practical Adoption for Assertion Generation?
V . Pulavarthi et al., “Are LLMs Ready for Practical Adoption for Assertion Generation?” InDATE Conference, 2025
2025
-
[13]
LAAG-RV: LLM Assisted Assertion Generation for RTL Design Verification,
K. Maddala et al., “LAAG-RV: LLM Assisted Assertion Generation for RTL Design Verification,” inIEEE 8th ITC India, 2024. [14]OpenCores, https://opencores.org
2024
-
[14]
Pinckney et al.,CVDP: A Next-Generation Benchmark Dataset for Evaluating LLMs and Agents on RTL Design and Verification, 2025
N. Pinckney et al.,CVDP: A Next-Generation Benchmark Dataset for Evaluating LLMs and Agents on RTL Design and Verification, 2025. ©2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional pu...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.