pith. sign in

arxiv: 2507.15664 · v3 · submitted 2025-07-21 · 💻 cs.AR

VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair

Pith reviewed 2026-05-19 04:04 UTC · model grok-4.3

classification 💻 cs.AR
keywords Retrieval-Augmented GenerationDesign for TestabilityRTL repairLarge language modelsVerilogElectronic design automationAutomated debugging
0
0 comments X

The pith

Retrieval of similar verified designs lets LLMs fix RTL testability issues automatically with 7.72 times the success rate of direct prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VeriRAG as a way to automate corrections for Design for Testability violations in RTL hardware descriptions. It builds a dataset of Verilog examples called VeriDFT where each design comes with a validated fix, then uses an autoencoder to find the most similar past cases and feeds those as references into an LLM. The LLM revises the new code through repeated steps that enforce test rules without breaking the ability to synthesize the design. If this works reliably it would cut down the expert manual effort currently needed to make hardware designs testable. Ablation checks confirm that both the retrieval step and the iteration loop add to the gains.

Core claim

VeriRAG retrieves structurally similar RTL designs from VeriDFT using an autoencoder similarity model, pairs each with a rigorously validated correction, and supplies them as references to an LLM inside an iterative revision pipeline that enforces DFT compliance while preserving synthesizability, producing a 7.72-fold increase in successful automated repair rate over zero-shot baselines.

What carries the argument

An autoencoder-based similarity model that locates reference RTL designs from VeriDFT, each carrying a validated DFT correction, which then guide an iterative LLM revision process.

If this is right

  • Fully automated DFT correction becomes possible for new RTL designs.
  • Successful repair rates rise by a factor of 7.72 relative to direct LLM prompting.
  • The iterative revision loop and the retrieval component each measurably improve outcomes.
  • Revised code remains synthesizable after DFT fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same retrieval-plus-iteration pattern could apply to other hardware verification tasks such as timing or power fixes.
  • Larger, more diverse VeriDFT-style collections would likely raise coverage for uncommon RTL styles.
  • EDA toolchains could embed this style of assistance to lower the designer time spent on testability.

Load-bearing premise

Reference designs found by similarity search supply corrections that transfer directly to the new input design without introducing functional errors or synthesis failures.

What would settle it

Run VeriRAG on a fresh collection of RTL designs with documented DFT violations that were never seen during dataset construction, then check whether every output passes standard DFT checks and synthesizes without error.

Figures

Figures reproduced from arXiv: 2507.15664 by Haomin Qi, Kexin Chen, Lihao Zhang, Soung Chang Liew, Yining Du, Yuyang Du.

Figure 1
Figure 1. Figure 1: Statistic overview of the VeriDFT dataset: (a) proportion of [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the Verilog-to-JSON transformation process: (a) shows the circuit diagramand (b) presents the corresponding RTL corresponding RTL implementation; (c) detailed netlist with low [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The VeriRAG framework – (a) training of autoencoder network, (b) RAG-based code revision pipeline in the testing process. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Success rates of preliminary DFT error corrections [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ultimate code revision success rates (with logical [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Large language models (LLMs) have demonstrated immense potential in computer-aided design (CAD), particularly for automated debugging and verification within electronic design automation (EDA) tools. However, Design for Testability (DFT) remains a relatively underexplored area. This paper presents VeriRAG, the first LLM-assisted DFT-EDA framework. VeriRAG leverages a Retrieval-Augmented Generation (RAG) approach to enable LLM to revise code to ensure DFT compliance. VeriRAG integrates (1) an autoencoder-based similarity measurement model for precise retrieval of reference RTL designs for the LLM, and (2) an iterative code revision pipeline that allows the LLM to ensure DFT compliance while maintaining synthesizability. To support VeriRAG, we introduce VeriDFT, a Verilog-based DFT dataset curated for DFT-aware RTL repairs. VeriRAG retrieves structurally similar RTL designs from VeriDFT, each paired with a rigorously validated correction, as references for code repair. With VeriRAG and VeriDFT, we achieve fully automated DFT correction -- resulting in a 7.72-fold improvement in successful repair rate compared to the zero-shot baseline (Fig. 5 in Section V). Ablation studies further confirm the contribution of each component of the VeriRAG framework. We open-source our data, models, and scripts at https://github.com/HarminChee/VeriRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VeriRAG, the first LLM-assisted DFT-EDA framework that combines an autoencoder-based similarity model for retrieving structurally similar RTL designs from the newly curated VeriDFT dataset with an iterative code revision pipeline to achieve automated DFT compliance while preserving synthesizability. The central empirical claim is a 7.72-fold improvement in successful repair rate over the zero-shot baseline, supported by ablation studies and illustrated in Fig. 5 of Section V.

Significance. If the reported gains prove robust, this work would represent a meaningful advance in applying retrieval-augmented techniques to underexplored DFT tasks within EDA, potentially reducing manual effort in RTL testability repair. The open-sourcing of the VeriDFT dataset, trained models, and scripts is a clear strength that aids reproducibility and future extensions. The empirical focus on held-out designs and synthesis pass rates provides a practical evaluation lens.

major comments (2)
  1. [Section V] Section V, Fig. 5 and associated text: the 7.72-fold improvement in successful repair rate is the load-bearing quantitative result, yet the manuscript does not provide the precise operational definition of 'successful repair' (e.g., synthesis success alone versus additional functional equivalence or formal verification checks) nor report variance or error bars across multiple LLM runs or random seeds; this directly affects the interpretability and reliability of the cross-method comparison.
  2. [Section IV] Section IV (methodology on autoencoder similarity): the retrieval step relies on a learned similarity threshold whose sensitivity is only partially addressed in the ablations; because the central claim depends on the quality of retrieved reference corrections transferring without introducing new functional or synthesis errors, a more systematic sensitivity analysis or threshold-robustness experiment would strengthen the result.
minor comments (2)
  1. Ensure that all ablation tables explicitly list the exact metrics (success rate, synthesis pass rate) and the number of designs evaluated so that readers can directly compare component contributions.
  2. The notation for the autoencoder embedding space and similarity metric could be formalized with a short equation or pseudocode to improve clarity for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the presentation and analysis.

read point-by-point responses
  1. Referee: [Section V] Section V, Fig. 5 and associated text: the 7.72-fold improvement in successful repair rate is the load-bearing quantitative result, yet the manuscript does not provide the precise operational definition of 'successful repair' (e.g., synthesis success alone versus additional functional equivalence or formal verification checks) nor report variance or error bars across multiple LLM runs or random seeds; this directly affects the interpretability and reliability of the cross-method comparison.

    Authors: We agree that an explicit operational definition is required for interpretability. In our evaluation protocol, a successful repair is defined as the output Verilog RTL passing synthesis without errors, satisfying DFT compliance (scan insertion and testability rule checks), and preserving functional equivalence to the original design as confirmed via simulation. We will add this precise definition to the text accompanying Fig. 5 in the revised Section V. Regarding variance, experiments used fixed deterministic settings for the LLM and a single run per held-out design to control computational cost. We will include a discussion of LLM stochasticity and report repair rates with standard deviation from three repeated runs on a representative subset of designs to provide error bars. revision: yes

  2. Referee: [Section IV] Section IV (methodology on autoencoder similarity): the retrieval step relies on a learned similarity threshold whose sensitivity is only partially addressed in the ablations; because the central claim depends on the quality of retrieved reference corrections transferring without introducing new functional or synthesis errors, a more systematic sensitivity analysis or threshold-robustness experiment would strengthen the result.

    Authors: We acknowledge that the current ablations focus primarily on the number of retrieved references rather than a full sweep of the similarity threshold. We will add a dedicated threshold-robustness experiment in the revised Section IV, varying the autoencoder similarity threshold across a range of values and reporting the resulting successful repair rates, synthesis pass rates, and any introduced functional or synthesis errors. This will directly demonstrate the stability of the retrieval component with respect to the central claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper describes an engineering system (VeriRAG + VeriDFT dataset) whose central claim is an empirical repair-rate improvement measured on held-out RTL designs. No derivation chain, equations, or uniqueness theorems are invoked that reduce to fitted parameters or self-referential definitions. The autoencoder similarity model and iterative revision pipeline are described as trained and validated components whose outputs are externally checked for synthesizability; the 7.72-fold gain is reported as a measured experimental result rather than a constructed identity. Self-citations, if present, are not load-bearing for the performance claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on standard assumptions about LLM behavior and synthesizability checks rather than new physical or mathematical axioms. The main added elements are the curated dataset and the retrieval model, both of which are empirical artifacts rather than free parameters fitted inside the central claim.

free parameters (1)
  • autoencoder similarity threshold
    The cutoff used to decide which reference designs are retrieved is chosen to balance relevance and coverage; its exact value is not stated in the abstract.
axioms (1)
  • domain assumption LLM-generated RTL remains functionally equivalent after DFT insertion when guided by retrieved examples
    Invoked when claiming that the iterative revision maintains correctness while adding testability features.
invented entities (1)
  • VeriDFT dataset no independent evidence
    purpose: Provide paired faulty and corrected RTL examples for retrieval
    New curated collection of Verilog designs with validated DFT fixes; independent evidence would be external validation on industry designs.

pith-pipeline@v0.9.0 · 5795 in / 1418 out tokens · 29848 ms · 2026-05-19T04:04:27.183372+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Evalu- ating llms for hardware design and test,

    J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Evalu- ating llms for hardware design and test,” in 2024 IEEE LLM Aided Design Workshop (LAD) . IEEE, 2024, pp. 1–6

  2. [2]

    Chipmnd: Llms for agile chip design,

    F. Firouzi, D. Z. Pan, J. Gu, B. Farahani, J. Chaud- huri, Z. Yin, P. Ma, P. Domanski, and K. Chakrabarty, “Chipmnd: Llms for agile chip design,” in 2025 IEEE 43rd VLSI Test Symposium (VTS) . IEEE, 2025, pp. 1– 10

  3. [3]

    Wang, C.-W

    L.-T. Wang, C.-W. Wu, and X. Wen, VLSI test principles and architectures: design for testability. Elsevier, 2006

  4. [4]

    Current issues and emerging techniques for vlsi testing-a review,

    G. Thakur, S. Jain, and H. Sohal, “Current issues and emerging techniques for vlsi testing-a review,” Measure- ment: Sensors, vol. 24, p. 100497, 2022

  5. [5]

    The potential of llms in hardware design,

    S. Alsaqer, S. Alajmi, I. Ahmad, and M. Alfailakawi, “The potential of llms in hardware design,” Journal of Engineering Research, 2024

  6. [6]

    Hardware design and verification with large language models: A literature survey, challenges, and open issues,

    M. Abdollahi, S. F. Yeganli, M. A. Baharloo, and A. Ba- niasadi, “Hardware design and verification with large language models: A literature survey, challenges, and open issues,” 2024

  7. [7]

    Benchmarking large language models for automated Verilog RTL code generation,

    S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan-Gavitt, and S. Garg, “Benchmarking large language models for automated Verilog RTL code generation,” in 2023 Design, Automation and Test in Europe Conference and Exhibition (DATE), 2023, pp. 1– 6

  8. [8]

    Chip- Chat: Challenges and opportunities in conversational hardware design,

    J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Chip- Chat: Challenges and opportunities in conversational hardware design,” in 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD) , 2023, pp. 1–6

  9. [9]

    Ver- ilogEval: Evaluating large language models for Verilog code generation,

    M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Ver- ilogEval: Evaluating large language models for Verilog code generation,” in 2023 IEEE/ACM International Con- ference on Computer Aided Design (ICCAD) , 2023, pp. 1–8

  10. [10]

    Deep- RTL: Bridging verilog understanding and generation with a unified representation model,

    Y . Liu, C. Xu, Y . Zhou, Z. Li, and Q. Xu, “Deep- RTL: Bridging verilog understanding and generation with a unified representation model,” arXiv preprint arXiv:2502.15832, 2025

  11. [11]

    Codev: Empowering llms for verilog generation through multi-level summarization,

    Y . Zhao, D. Huang, C. Li, P. Jin, M. Song, Y . Xu, Z. Nan, M. Gao, T. Ma, L. Qi et al., “CodeV: Empow- ering LLMs with HDL generation through multi-level summarization,” arXiv preprint arXiv:2407.10424, 2024

  12. [12]

    LLM for complex signal processing in FPGA- based software defined radios: A case study on FFT,

    Y . Du, H. Deng, S. C. Liew, Y . Shao, K. Chen, and H. Chen, “LLM for complex signal processing in FPGA- based software defined radios: A case study on FFT,” in 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall), 2024, pp. 1–6

  13. [13]

    Verigen: A large language model for Verilog code generation,

    S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan- Gavitt, R. Karri, and S. Garg, “Verigen: A large language model for Verilog code generation,” ACM Transactions on Design Automation of Electronic Systems , vol. 29, no. 3, pp. 1–31, 2024

  14. [14]

    A secure DFT architecture protecting crypto chips against scan-based attacks,

    W. Wang, J. Wang, W. Wang, P. Liu, and S. Cai, “A secure DFT architecture protecting crypto chips against scan-based attacks,” IEEE Access , vol. 7, pp. 22 206– 22 213, 2019

  15. [15]

    End-to-end testing for boards and systems using boundary scan,

    R. Barr, C.-H. Chiang, and E. Wallace, “End-to-end testing for boards and systems using boundary scan,” in Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159), 2000, pp. 585–592

  16. [16]

    Design for testability and built-in self test: a review,

    H. Nagle, S. Roy, C. Hawkins, M. McNamer, and R. Fritzemeier, “Design for testability and built-in self test: a review,” IEEE Transactions on Industrial Elec- tronics, vol. 36, no. 2, pp. 129–140, 1989

  17. [17]

    The im- plementation of ieee std 1149.1 boundary scan test strategy within a cellular infrastructure production envi- ronment,

    S. Harrison, P. Collins, and G. Noeninckx, “The im- plementation of ieee std 1149.1 boundary scan test strategy within a cellular infrastructure production envi- ronment,” in Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159) , 2000, pp. 45–54

  18. [18]

    Retrieval-augmented generation for knowledge- intensive NLP tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel et al. , “Retrieval-augmented generation for knowledge- intensive NLP tasks,” Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  19. [19]

    Design and implementation of a motion controller for XYZ table based on multiprocessor SoPC,

    Y .-S. Kung, T.-W. Tsui, and N.-H. Shieh, “Design and implementation of a motion controller for XYZ table based on multiprocessor SoPC,” in 2009 Chinese Control and Decision Conference , 2009, pp. 241–246

  20. [20]

    End-to-end deep learning framework for printed circuit board manufac- turing defect classification,

    A. Bhattacharya and S. G. Cloutier, “End-to-end deep learning framework for printed circuit board manufac- turing defect classification,” Scientific reports , vol. 12, no. 1, p. 12559, 2022

  21. [21]

    Deep learning-based integrated circuit surface defect detection: Addressing information density imbalance for industrial application,

    X. Wang, S. Gao, J. Guo, C. Wang, L. Xiong, and Y . Zou, “Deep learning-based integrated circuit surface defect detection: Addressing information density imbalance for industrial application,” International Journal of Compu- tational Intelligence Systems, vol. 17, no. 1, p. 29, 2024

  22. [22]

    A timing engine inspired graph neural network model for pre-routing slack prediction,

    Z. Guo, M. Liu, J. Gu, S. Zhang, D. Z. Pan, and Y . Lin, “A timing engine inspired graph neural network model for pre-routing slack prediction,” in Proceedings of the 59th ACM/IEEE Design Automation Conference , 2022, pp. 1207–1212

  23. [23]

    A deep learn- ing based power estimations mechanism for CMOS VLSI circuit,

    N. Sivakumar, N. Suresh, and G. Arapana, “A deep learn- ing based power estimations mechanism for CMOS VLSI circuit,” ICTACT Journal on Microelectronics , vol. 8, no. 4, pp. 1471–1475, 2023

  24. [24]

    A survey on software clone detection research,

    C. K. Roy and J. R. Cordy, “A survey on software clone detection research,” Queen’s School of computing TR , vol. 541, no. 115, pp. 64–68, 2007

  25. [25]

    A survey on clone refactoring and tracking,

    M. Mondal, C. K. Roy, and K. A. Schneider, “A survey on clone refactoring and tracking,” Journal of Systems and Software, vol. 159, p. 110429, 2020

  26. [26]

    Benchmarking large lan- guage models for automated verilog rtl code generation,

    S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan-Gavitt, and S. Garg, “Benchmarking large lan- guage models for automated verilog rtl code generation,” in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE) . IEEE, 2023, pp. 1–6

  27. [27]

    Chemist-X: large language model-empowered agent for reaction condition recommendation in chemical synthesis,

    K. Chen, J. Li, K. Wang, Y . Du, J. Yu, J. Lu, L. Li, J. Qiu, J. Pan, Y . Huang et al. , “Chemist-X: large language model-empowered agent for reaction condition recommendation in chemical synthesis,” arXiv preprint arXiv:2311.10776, 2023

  28. [28]

    Su- pervised contrastive learning,

    P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Su- pervised contrastive learning,” Advances in neural infor- mation processing systems , vol. 33, pp. 18 661–18 673, 2020

  29. [29]

    A con- trastive learning approach for training variational autoen- coder priors,

    J. Aneja, A. Schwing, J. Kautz, and A. Vahdat, “A con- trastive learning approach for training variational autoen- coder priors,” Advances in neural information processing systems, vol. 34, pp. 480–493, 2021