arxiv: 2605.06434 · v1 · submitted 2026-05-07 · 💻 cs.AI

Recognition: unknown

Knowledge Graphs, the Missing Link in Agentic AI-based Formal Verification

Vaisakh Naduvodi Viswambharan , Keerthan Kopparam Radhakrishna , Deepak Narayan Gadde , Aman Kumar

Authors on Pith no claims yet

Pith reviewed 2026-05-08 09:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords knowledge graphsformal verificationsystemverilog assertionsllm agentsspecification to rtl groundingassertion synthesismulti-agent workflows

0 comments

The pith

Knowledge graphs supply traceable links between natural-language specs, RTL designs, and tool feedback so LLM agents can generate and refine SystemVerilog assertions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper demonstrates that a structured knowledge graph built from intermediate representations of specifications, RTL code, and formal verification tool outputs lets multi-agent LLM systems produce assertions that ground requirements more accurately in the actual design. The graph connects requirements to signals, hierarchy, assumptions, and properties, supporting three refinement loops: syntax repair from diagnostics, counterexample-guided fixes, and coverage-driven property addition. Current LLM approaches treat specs and RTL as plain text, which causes semantic mismatches and frequent compilation failures. With the graph in place, the workflow yields compilable assertions and formal coverage between 78.5 percent and 99.4 percent on seven benchmark designs. The result matters for hardware verification because it reduces the manual effort needed to create properties that can catch timing and functional bugs early.

Core claim

A verification-centric knowledge graph constructed from structured intermediate representations extracted from natural-language specifications, RTL, and formal-tool feedback supplies traceable context that allows a multi-agent LLM workflow to generate SystemVerilog Assertions and execute syntax-repair, counterexample-correction, and coverage-augmentation loops, improving specification-to-RTL grounding and producing assertions with formal coverage ranging from 78.5 percent to 99.4 percent.

What carries the argument

A verification-centric Knowledge Graph built from intermediate representations of specifications, RTL hierarchy, signals, assumptions, and properties, which a multi-agent workflow queries and updates to drive three refinement loops.

If this is right

KG-based retrieval reduces syntax failures and repair overhead during assertion generation.
The workflow produces consistently compilable SVAs that achieve formal coverage from 78.5 to 99.4 percent across the tested designs.
Traceable links in the graph enable targeted counterexample correction and coverage-directed property augmentation.
Convergence speed and final coverage still depend on design complexity, with temporal and arithmetic reasoning remaining difficult for current LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph structure could be reused to guide other agentic tasks such as automatic test generation or bug localization in the same design.
Design-specific performance variation suggests that adding domain-specific arithmetic reasoning modules to the agents might further raise coverage on complex blocks.
If the KG construction step proves reliable across more designs, the method could shorten the time from English requirements to provable properties in hardware projects.

Load-bearing premise

Intermediate representations from specifications, RTL, and tool feedback can be turned into a knowledge graph that supplies enough traceable context for LLM agents to perform complex temporal and arithmetic reasoning.

What would settle it

Running the same seven benchmarks with the identical multi-agent workflow but without the knowledge graph and measuring whether syntax-repair overhead rises and formal coverage falls below the reported 78.5-99.4 percent range.

Figures

Figures reproduced from arXiv: 2605.06434 by Aman Kumar, Deepak Narayan Gadde, Keerthan Kopparam Radhakrishna, Vaisakh Naduvodi Viswambharan.

**Figure 1.** Figure 1: End-to-end workflow of the proposed approach. NetworkX is used for KG construction, and PyVis provides interactive HTML visualization. view at source ↗

read the original abstract

Recent advances in Large Language Models (LLMs) have enabled workflows that generate SystemVerilog Assertions (SVAs) from natural-language specifications, with the potential to accelerate Formal Verification (FV). However, high-quality assertion synthesis remains challenging because specifications are often ambiguous or incomplete and critical micro-architectural details reside in the Register Transfer Level (RTL). Many existing approaches treat the specification and RTL as loosely structured text, which weakens specification-to-RTL grounding and leads to semantic mismatches and frequent syntax failures during formal parsing and elaboration. This work addresses these limitations with a verification-centric Knowledge Graph (KG) constructed from structured Intermediate Representations (IRs) extracted from the specification, RTL, and formal-tool feedback, including syntax diagnostics, Counterexamples (CEXs), and coverage reports. The KG links requirements, design hierarchy, signals, assumptions, and properties to provide traceable, design-grounded context for generation. A multi-agent workflow queries and updates this KG to generate SVAs and to drive three refinement loops: syntax repair guided by tool diagnostics, CEX-guided correction using trace links, and coverage-directed property augmentation. Evaluation across seven benchmark designs indicates that KG-based context retrieval improves specification-to-RTL grounding and consistently produces compilable SVAs with low syntax-repair overhead. The approach achieves formal coverage ranging from 78.5% to 99.4%, though convergence exhibits design dependence with complex temporal and arithmetic reasoning remaining challenging for current LLM capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The KG plus multi-agent loops produce usable SVAs on the tested designs, but nothing isolates whether the graph structure itself drives the gains over plain iterative prompting.

read the letter

The paper builds a verification-centric knowledge graph from natural-language specs, RTL, and tool feedback like syntax errors, counterexamples, and coverage reports. A multi-agent setup then queries this graph to generate SystemVerilog assertions and runs three refinement loops: syntax repair, CEX-guided fixes, and coverage augmentation. On seven benchmarks it reports 78.5 to 99.4 percent formal coverage with low syntax-repair overhead and better spec-to-RTL grounding than loose text prompts. That construction and the explicit feedback loops are the concrete new pieces relative to earlier LLM-only SVA work. The approach is practically motivated and the numbers are encouraging for a first cut at structured context in hardware verification. The authors also note the remaining trouble with complex temporal and arithmetic properties, which keeps the claims grounded. The main gap is the lack of controls. There is no ablation that keeps the same agents, IR extraction, and loops but replaces KG queries with flat text or simple retrieval. Without that, the coverage and repair improvements could come from the iteration itself rather than the graph structure. The evaluation also lacks direct baselines against the prior LLM methods it cites, and the design dependence is acknowledged but not quantified across a broader set. These are fixable issues rather than fatal ones. The work is aimed at verification engineers and AI-for-hardware researchers who already deal with assertion generation. A reader looking for concrete pipeline ideas will find usable details on IR extraction and the three-loop workflow. It is coherent on its own terms and shows honest engagement with the limits of current LLMs, so it deserves a serious referee. I would send it out with a request for the missing ablations and a clearer comparison table.

Referee Report

3 major / 2 minor

Summary. The paper proposes a verification-centric Knowledge Graph (KG) constructed from structured intermediate representations (IRs) extracted from natural-language specifications, RTL code, and formal-tool feedback (syntax diagnostics, CEXs, coverage reports). A multi-agent LLM workflow queries and updates this KG to synthesize SystemVerilog Assertions (SVAs) and drives three refinement loops: syntax repair, CEX-guided correction, and coverage-directed augmentation. Evaluation on seven benchmark designs reports formal coverage of 78.5–99.4% with low syntax-repair overhead, claiming that KG-based context retrieval improves specification-to-RTL grounding over text-only approaches.

Significance. If the central claim can be isolated from confounding factors, the work offers a practical engineering contribution to agentic AI for formal verification by demonstrating how a traceable KG can supply design-grounded context to LLM agents, potentially reducing semantic mismatches and repair iterations in SVA generation. The empirical results on multiple benchmarks highlight both the promise and the remaining challenges with complex temporal/arithmetic reasoning.

major comments (3)

[Evaluation] Evaluation section: the reported coverage range (78.5–99.4%) and low syntax-repair overhead are presented without any ablation that disables KG queries while retaining the same multi-agent loops, IR extraction, and refinement mechanisms (or substitutes flat-text concatenation). This prevents attribution of gains specifically to the KG structure rather than iterative feedback or prompting.
[Evaluation] The manuscript acknowledges design dependence and difficulty with complex temporal/arithmetic reasoning but provides no quantitative breakdown of per-design variance, failure modes, or statistical details (e.g., run-to-run variance, number of trials, or confidence intervals) for the coverage metrics.
[Approach and Evaluation] The central claim that KG-based retrieval 'improves specification-to-RTL grounding' rests on the weakest assumption that IRs can be reliably structured into a KG supplying sufficient traceable context; no evidence is given that the KG construction itself is robust or that alternative context mechanisms were tested.

minor comments (2)

[Abstract and Evaluation] The abstract and introduction use 'design dependence' without defining the specific designs or metrics used to quantify it; a table or subsection clarifying per-benchmark convergence behavior would improve clarity.
[Approach] Notation for the KG nodes/edges (requirements, signals, properties) is introduced informally; an early figure or table defining the schema would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important opportunities to strengthen the evaluation and clarify the contribution of the knowledge graph. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported coverage range (78.5–99.4%) and low syntax-repair overhead are presented without any ablation that disables KG queries while retaining the same multi-agent loops, IR extraction, and refinement mechanisms (or substitutes flat-text concatenation). This prevents attribution of gains specifically to the KG structure rather than iterative feedback or prompting.

Authors: We agree that an explicit ablation isolating the KG's role is necessary to attribute performance gains specifically to the structured retrieval mechanism. In the revised manuscript we will add a new ablation experiment that retains the identical multi-agent workflow, IR extraction pipeline, and three refinement loops but replaces KG queries with flat-text concatenation of the same extracted IRs. We will report the resulting formal coverage and syntax-repair overhead for direct comparison with the KG-based results. This addition will be placed in the Evaluation section alongside the existing benchmark tables. revision: yes
Referee: [Evaluation] The manuscript acknowledges design dependence and difficulty with complex temporal/arithmetic reasoning but provides no quantitative breakdown of per-design variance, failure modes, or statistical details (e.g., run-to-run variance, number of trials, or confidence intervals) for the coverage metrics.

Authors: We acknowledge the value of more granular statistical reporting. The current results reflect single-run executions per design, driven by the high cost of repeated LLM calls and formal-tool invocations. In revision we will expand the Evaluation section with a per-design table that includes individual coverage percentages, a categorized list of observed failure modes (e.g., incomplete temporal property coverage on arithmetic-heavy modules), and any run-to-run variance observed in preliminary multi-trial tests on two of the smaller benchmarks. We will also add a brief discussion of confidence-interval estimation and note the practical constraints on exhaustive statistical sampling. revision: partial
Referee: [Approach and Evaluation] The central claim that KG-based retrieval 'improves specification-to-RTL grounding' rests on the weakest assumption that IRs can be reliably structured into a KG supplying sufficient traceable context; no evidence is given that the KG construction itself is robust or that alternative context mechanisms were tested.

Authors: We accept that the manuscript would benefit from greater transparency on KG construction robustness and from explicit comparison language. Section 3 already details the IR extraction rules and entity-linking procedure; we will augment this with a new subsection on potential failure modes (LLM parsing inaccuracies, incomplete signal extraction) and the mitigation strategies employed by the update loops. While we did not evaluate alternative context stores (e.g., vector databases), the traceable links are essential for the CEX-guided and coverage-directed refinement loops described in Section 4. We will revise the abstract and introduction to frame the improvement as demonstrated through the integrated workflow rather than as a standalone claim about KG superiority, and we will add a limitations paragraph on the scope of the current evidence. revision: yes

Circularity Check

0 steps flagged

Empirical engineering pipeline with no derivation chain or self-referential reductions

full rationale

The paper presents a multi-agent workflow that builds a KG from IRs extracted from natural-language specs, RTL, and tool feedback (syntax diagnostics, CEXs, coverage reports), then uses it to generate and refine SVAs via three loops. Evaluation on seven benchmarks reports coverage 78.5–99.4% and low syntax-repair overhead. No equations, fitted parameters, or first-principles predictions appear; claims rest directly on experimental outcomes. No self-citations are used as load-bearing uniqueness theorems or to smuggle ansatzes. The work is self-contained and falsifiable via replication on the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about reliable IR extraction and LLM reasoning over the KG; no free parameters or new physical entities are introduced.

axioms (2)

domain assumption Structured IRs can be extracted from specifications, RTL, and formal-tool feedback to form a useful verification-centric KG
Invoked in the description of KG construction from spec, RTL, and tool outputs.
domain assumption LLM agents can effectively query and update the KG to generate and refine SVAs including complex temporal reasoning
Central to the multi-agent workflow and refinement loops.

invented entities (1)

Verification-centric Knowledge Graph no independent evidence
purpose: To link requirements, design hierarchy, signals, assumptions, and properties for traceable context in SVA generation
Constructed from IRs; no independent falsifiable evidence provided beyond the reported benchmark results.

pith-pipeline@v0.9.0 · 5578 in / 1422 out tokens · 47927 ms · 2026-05-08T09:50:04.375033+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
cs.CR 2026-05 unverdicted novelty 3.0

A survey of LLM applications in secure hardware design covering EDA synthesis, vulnerability analysis, countermeasures, and educational uses.
LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges
cs.CR 2026-05 accept novelty 2.0

LLMs enable RTL code generation and vulnerability analysis in hardware design but introduce data contamination and adversarial risks that require red-teaming and dynamic benchmarking.

Reference graph

Works this paper leans on

14 extracted references · cited by 1 Pith paper

[1]

(Security) Assertions by Large Language Models,

R. Kande et al., “(Security) Assertions by Large Language Models,” IEEE Transactions on Information Forensics and Security, 2024

2024
[2]

AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs,

Z. Yan et al., “AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs,” inProceedings of the 30th ASPDAC Conference, ACM, 2025

2025
[3]

ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation,

B. Mali et al., “ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation,” inIEEE ISVLSI, 2024

2024
[4]

Saarthi: The First AI Formal Verification Engineer,

A. Kumar et al., “Saarthi: The First AI Formal Verification Engineer,” inDVCon US, 2025

2025
[5]

Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification,

D. N. Gadde et al., “Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification,” in38th SBCCI, 2025

2025
[6]

AssertionForge: Enhancing Formal Verification Assertion Generation with Structured Representation of Specifications and RTL,

Y . Bai et al., “AssertionForge: Enhancing Formal Verification Assertion Generation with Structured Representation of Specifications and RTL,” inIEEE ICLAD, 2025

2025
[7]

Spec2Assertion: Automatic Pre-RTL Assertion Generation using LLMs with Progressive Regularization,

F. Wu et al., “Spec2Assertion: Automatic Pre-RTL Assertion Generation using LLMs with Progressive Regularization,” arXiv, 2025

2025
[8]

LASP: LLM Assisted Security Property Generation for SoC Verification,

A. Ayalasomayajula et al., “LASP: LLM Assisted Security Property Generation for SoC Verification,” inACM/IEEE 6th MLCAD, 2024

2024
[9]

AssertMiner: Module-Level Spec Generation and Asser- tion Mining using Static Analysis Guided LLMs,

H. Lyu et al., “AssertMiner: Module-Level Spec Generation and Asser- tion Mining using Static Analysis Guided LLMs,” arXiv, 2025

2025
[10]

Enhanced VLSI Assertion Generation: Conforming to High-Level Specifications and Reducing LLM Hallucinations with RAG,

H. A. Quddus et al., “Enhanced VLSI Assertion Generation: Conforming to High-Level Specifications and Reducing LLM Hallucinations with RAG,” inDVCon Europe, 2024

2024
[11]

LISA: LLM Informed Systemverilog Assertion generation with RAG and Chain-of-Thought,

S. Paul et al., “LISA: LLM Informed Systemverilog Assertion generation with RAG and Chain-of-Thought,” inIEEE ISVLSI, 2025

2025
[12]

Are LLMs Ready for Practical Adoption for Assertion Generation?

V . Pulavarthi et al., “Are LLMs Ready for Practical Adoption for Assertion Generation?” InDATE Conference, 2025

2025
[13]

LAAG-RV: LLM Assisted Assertion Generation for RTL Design Verification,

K. Maddala et al., “LAAG-RV: LLM Assisted Assertion Generation for RTL Design Verification,” inIEEE 8th ITC India, 2024. [14]OpenCores, https://opencores.org

2024
[14]

Pinckney et al.,CVDP: A Next-Generation Benchmark Dataset for Evaluating LLMs and Agents on RTL Design and Verification, 2025

N. Pinckney et al.,CVDP: A Next-Generation Benchmark Dataset for Evaluating LLMs and Agents on RTL Design and Verification, 2025. ©2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional pu...

2025