VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair
Pith reviewed 2026-05-19 04:04 UTC · model grok-4.3
The pith
Retrieval of similar verified designs lets LLMs fix RTL testability issues automatically with 7.72 times the success rate of direct prompting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VeriRAG retrieves structurally similar RTL designs from VeriDFT using an autoencoder similarity model, pairs each with a rigorously validated correction, and supplies them as references to an LLM inside an iterative revision pipeline that enforces DFT compliance while preserving synthesizability, producing a 7.72-fold increase in successful automated repair rate over zero-shot baselines.
What carries the argument
An autoencoder-based similarity model that locates reference RTL designs from VeriDFT, each carrying a validated DFT correction, which then guide an iterative LLM revision process.
If this is right
- Fully automated DFT correction becomes possible for new RTL designs.
- Successful repair rates rise by a factor of 7.72 relative to direct LLM prompting.
- The iterative revision loop and the retrieval component each measurably improve outcomes.
- Revised code remains synthesizable after DFT fixes.
Where Pith is reading between the lines
- The same retrieval-plus-iteration pattern could apply to other hardware verification tasks such as timing or power fixes.
- Larger, more diverse VeriDFT-style collections would likely raise coverage for uncommon RTL styles.
- EDA toolchains could embed this style of assistance to lower the designer time spent on testability.
Load-bearing premise
Reference designs found by similarity search supply corrections that transfer directly to the new input design without introducing functional errors or synthesis failures.
What would settle it
Run VeriRAG on a fresh collection of RTL designs with documented DFT violations that were never seen during dataset construction, then check whether every output passes standard DFT checks and synthesizes without error.
Figures
read the original abstract
Large language models (LLMs) have demonstrated immense potential in computer-aided design (CAD), particularly for automated debugging and verification within electronic design automation (EDA) tools. However, Design for Testability (DFT) remains a relatively underexplored area. This paper presents VeriRAG, the first LLM-assisted DFT-EDA framework. VeriRAG leverages a Retrieval-Augmented Generation (RAG) approach to enable LLM to revise code to ensure DFT compliance. VeriRAG integrates (1) an autoencoder-based similarity measurement model for precise retrieval of reference RTL designs for the LLM, and (2) an iterative code revision pipeline that allows the LLM to ensure DFT compliance while maintaining synthesizability. To support VeriRAG, we introduce VeriDFT, a Verilog-based DFT dataset curated for DFT-aware RTL repairs. VeriRAG retrieves structurally similar RTL designs from VeriDFT, each paired with a rigorously validated correction, as references for code repair. With VeriRAG and VeriDFT, we achieve fully automated DFT correction -- resulting in a 7.72-fold improvement in successful repair rate compared to the zero-shot baseline (Fig. 5 in Section V). Ablation studies further confirm the contribution of each component of the VeriRAG framework. We open-source our data, models, and scripts at https://github.com/HarminChee/VeriRAG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VeriRAG, the first LLM-assisted DFT-EDA framework that combines an autoencoder-based similarity model for retrieving structurally similar RTL designs from the newly curated VeriDFT dataset with an iterative code revision pipeline to achieve automated DFT compliance while preserving synthesizability. The central empirical claim is a 7.72-fold improvement in successful repair rate over the zero-shot baseline, supported by ablation studies and illustrated in Fig. 5 of Section V.
Significance. If the reported gains prove robust, this work would represent a meaningful advance in applying retrieval-augmented techniques to underexplored DFT tasks within EDA, potentially reducing manual effort in RTL testability repair. The open-sourcing of the VeriDFT dataset, trained models, and scripts is a clear strength that aids reproducibility and future extensions. The empirical focus on held-out designs and synthesis pass rates provides a practical evaluation lens.
major comments (2)
- [Section V] Section V, Fig. 5 and associated text: the 7.72-fold improvement in successful repair rate is the load-bearing quantitative result, yet the manuscript does not provide the precise operational definition of 'successful repair' (e.g., synthesis success alone versus additional functional equivalence or formal verification checks) nor report variance or error bars across multiple LLM runs or random seeds; this directly affects the interpretability and reliability of the cross-method comparison.
- [Section IV] Section IV (methodology on autoencoder similarity): the retrieval step relies on a learned similarity threshold whose sensitivity is only partially addressed in the ablations; because the central claim depends on the quality of retrieved reference corrections transferring without introducing new functional or synthesis errors, a more systematic sensitivity analysis or threshold-robustness experiment would strengthen the result.
minor comments (2)
- Ensure that all ablation tables explicitly list the exact metrics (success rate, synthesis pass rate) and the number of designs evaluated so that readers can directly compare component contributions.
- The notation for the autoencoder embedding space and similarity metric could be formalized with a short equation or pseudocode to improve clarity for readers outside the immediate subfield.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the presentation and analysis.
read point-by-point responses
-
Referee: [Section V] Section V, Fig. 5 and associated text: the 7.72-fold improvement in successful repair rate is the load-bearing quantitative result, yet the manuscript does not provide the precise operational definition of 'successful repair' (e.g., synthesis success alone versus additional functional equivalence or formal verification checks) nor report variance or error bars across multiple LLM runs or random seeds; this directly affects the interpretability and reliability of the cross-method comparison.
Authors: We agree that an explicit operational definition is required for interpretability. In our evaluation protocol, a successful repair is defined as the output Verilog RTL passing synthesis without errors, satisfying DFT compliance (scan insertion and testability rule checks), and preserving functional equivalence to the original design as confirmed via simulation. We will add this precise definition to the text accompanying Fig. 5 in the revised Section V. Regarding variance, experiments used fixed deterministic settings for the LLM and a single run per held-out design to control computational cost. We will include a discussion of LLM stochasticity and report repair rates with standard deviation from three repeated runs on a representative subset of designs to provide error bars. revision: yes
-
Referee: [Section IV] Section IV (methodology on autoencoder similarity): the retrieval step relies on a learned similarity threshold whose sensitivity is only partially addressed in the ablations; because the central claim depends on the quality of retrieved reference corrections transferring without introducing new functional or synthesis errors, a more systematic sensitivity analysis or threshold-robustness experiment would strengthen the result.
Authors: We acknowledge that the current ablations focus primarily on the number of retrieved references rather than a full sweep of the similarity threshold. We will add a dedicated threshold-robustness experiment in the revised Section IV, varying the autoencoder similarity threshold across a range of values and reporting the resulting successful repair rates, synthesis pass rates, and any introduced functional or synthesis errors. This will directly demonstrate the stability of the retrieval component with respect to the central claim. revision: yes
Circularity Check
No significant circularity in empirical framework
full rationale
The paper describes an engineering system (VeriRAG + VeriDFT dataset) whose central claim is an empirical repair-rate improvement measured on held-out RTL designs. No derivation chain, equations, or uniqueness theorems are invoked that reduce to fitted parameters or self-referential definitions. The autoencoder similarity model and iterative revision pipeline are described as trained and validated components whose outputs are externally checked for synthesizability; the 7.72-fold gain is reported as a measured experimental result rather than a constructed identity. Self-citations, if present, are not load-bearing for the performance claim.
Axiom & Free-Parameter Ledger
free parameters (1)
- autoencoder similarity threshold
axioms (1)
- domain assumption LLM-generated RTL remains functionally equivalent after DFT insertion when guided by retrieved examples
invented entities (1)
-
VeriDFT dataset
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VeriRAG integrates (1) an autoencoder-based similarity measurement model for precise retrieval of reference RTL designs... (2) an iterative code revision pipeline... VeriDFT... 7.72-fold improvement
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-task autoencoder... contrastive term Lcontrast... cosine similarity sij
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Evalu- ating llms for hardware design and test,
J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Evalu- ating llms for hardware design and test,” in 2024 IEEE LLM Aided Design Workshop (LAD) . IEEE, 2024, pp. 1–6
work page 2024
-
[2]
Chipmnd: Llms for agile chip design,
F. Firouzi, D. Z. Pan, J. Gu, B. Farahani, J. Chaud- huri, Z. Yin, P. Ma, P. Domanski, and K. Chakrabarty, “Chipmnd: Llms for agile chip design,” in 2025 IEEE 43rd VLSI Test Symposium (VTS) . IEEE, 2025, pp. 1– 10
work page 2025
-
[3]
L.-T. Wang, C.-W. Wu, and X. Wen, VLSI test principles and architectures: design for testability. Elsevier, 2006
work page 2006
-
[4]
Current issues and emerging techniques for vlsi testing-a review,
G. Thakur, S. Jain, and H. Sohal, “Current issues and emerging techniques for vlsi testing-a review,” Measure- ment: Sensors, vol. 24, p. 100497, 2022
work page 2022
-
[5]
The potential of llms in hardware design,
S. Alsaqer, S. Alajmi, I. Ahmad, and M. Alfailakawi, “The potential of llms in hardware design,” Journal of Engineering Research, 2024
work page 2024
-
[6]
M. Abdollahi, S. F. Yeganli, M. A. Baharloo, and A. Ba- niasadi, “Hardware design and verification with large language models: A literature survey, challenges, and open issues,” 2024
work page 2024
-
[7]
Benchmarking large language models for automated Verilog RTL code generation,
S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan-Gavitt, and S. Garg, “Benchmarking large language models for automated Verilog RTL code generation,” in 2023 Design, Automation and Test in Europe Conference and Exhibition (DATE), 2023, pp. 1– 6
work page 2023
-
[8]
Chip- Chat: Challenges and opportunities in conversational hardware design,
J. Blocklove, S. Garg, R. Karri, and H. Pearce, “Chip- Chat: Challenges and opportunities in conversational hardware design,” in 2023 ACM/IEEE 5th Workshop on Machine Learning for CAD (MLCAD) , 2023, pp. 1–6
work page 2023
-
[9]
Ver- ilogEval: Evaluating large language models for Verilog code generation,
M. Liu, N. Pinckney, B. Khailany, and H. Ren, “Ver- ilogEval: Evaluating large language models for Verilog code generation,” in 2023 IEEE/ACM International Con- ference on Computer Aided Design (ICCAD) , 2023, pp. 1–8
work page 2023
-
[10]
Deep- RTL: Bridging verilog understanding and generation with a unified representation model,
Y . Liu, C. Xu, Y . Zhou, Z. Li, and Q. Xu, “Deep- RTL: Bridging verilog understanding and generation with a unified representation model,” arXiv preprint arXiv:2502.15832, 2025
-
[11]
Codev: Empowering llms for verilog generation through multi-level summarization,
Y . Zhao, D. Huang, C. Li, P. Jin, M. Song, Y . Xu, Z. Nan, M. Gao, T. Ma, L. Qi et al., “CodeV: Empow- ering LLMs with HDL generation through multi-level summarization,” arXiv preprint arXiv:2407.10424, 2024
-
[12]
LLM for complex signal processing in FPGA- based software defined radios: A case study on FFT,
Y . Du, H. Deng, S. C. Liew, Y . Shao, K. Chen, and H. Chen, “LLM for complex signal processing in FPGA- based software defined radios: A case study on FFT,” in 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall), 2024, pp. 1–6
work page 2024
-
[13]
Verigen: A large language model for Verilog code generation,
S. Thakur, B. Ahmad, H. Pearce, B. Tan, B. Dolan- Gavitt, R. Karri, and S. Garg, “Verigen: A large language model for Verilog code generation,” ACM Transactions on Design Automation of Electronic Systems , vol. 29, no. 3, pp. 1–31, 2024
work page 2024
-
[14]
A secure DFT architecture protecting crypto chips against scan-based attacks,
W. Wang, J. Wang, W. Wang, P. Liu, and S. Cai, “A secure DFT architecture protecting crypto chips against scan-based attacks,” IEEE Access , vol. 7, pp. 22 206– 22 213, 2019
work page 2019
-
[15]
End-to-end testing for boards and systems using boundary scan,
R. Barr, C.-H. Chiang, and E. Wallace, “End-to-end testing for boards and systems using boundary scan,” in Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159), 2000, pp. 585–592
work page 2000
-
[16]
Design for testability and built-in self test: a review,
H. Nagle, S. Roy, C. Hawkins, M. McNamer, and R. Fritzemeier, “Design for testability and built-in self test: a review,” IEEE Transactions on Industrial Elec- tronics, vol. 36, no. 2, pp. 129–140, 1989
work page 1989
-
[17]
S. Harrison, P. Collins, and G. Noeninckx, “The im- plementation of ieee std 1149.1 boundary scan test strategy within a cellular infrastructure production envi- ronment,” in Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159) , 2000, pp. 45–54
work page 2000
-
[18]
Retrieval-augmented generation for knowledge- intensive NLP tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel et al. , “Retrieval-augmented generation for knowledge- intensive NLP tasks,” Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020
work page 2020
-
[19]
Design and implementation of a motion controller for XYZ table based on multiprocessor SoPC,
Y .-S. Kung, T.-W. Tsui, and N.-H. Shieh, “Design and implementation of a motion controller for XYZ table based on multiprocessor SoPC,” in 2009 Chinese Control and Decision Conference , 2009, pp. 241–246
work page 2009
-
[20]
End-to-end deep learning framework for printed circuit board manufac- turing defect classification,
A. Bhattacharya and S. G. Cloutier, “End-to-end deep learning framework for printed circuit board manufac- turing defect classification,” Scientific reports , vol. 12, no. 1, p. 12559, 2022
work page 2022
-
[21]
X. Wang, S. Gao, J. Guo, C. Wang, L. Xiong, and Y . Zou, “Deep learning-based integrated circuit surface defect detection: Addressing information density imbalance for industrial application,” International Journal of Compu- tational Intelligence Systems, vol. 17, no. 1, p. 29, 2024
work page 2024
-
[22]
A timing engine inspired graph neural network model for pre-routing slack prediction,
Z. Guo, M. Liu, J. Gu, S. Zhang, D. Z. Pan, and Y . Lin, “A timing engine inspired graph neural network model for pre-routing slack prediction,” in Proceedings of the 59th ACM/IEEE Design Automation Conference , 2022, pp. 1207–1212
work page 2022
-
[23]
A deep learn- ing based power estimations mechanism for CMOS VLSI circuit,
N. Sivakumar, N. Suresh, and G. Arapana, “A deep learn- ing based power estimations mechanism for CMOS VLSI circuit,” ICTACT Journal on Microelectronics , vol. 8, no. 4, pp. 1471–1475, 2023
work page 2023
-
[24]
A survey on software clone detection research,
C. K. Roy and J. R. Cordy, “A survey on software clone detection research,” Queen’s School of computing TR , vol. 541, no. 115, pp. 64–68, 2007
work page 2007
-
[25]
A survey on clone refactoring and tracking,
M. Mondal, C. K. Roy, and K. A. Schneider, “A survey on clone refactoring and tracking,” Journal of Systems and Software, vol. 159, p. 110429, 2020
work page 2020
-
[26]
Benchmarking large lan- guage models for automated verilog rtl code generation,
S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan-Gavitt, and S. Garg, “Benchmarking large lan- guage models for automated verilog rtl code generation,” in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE) . IEEE, 2023, pp. 1–6
work page 2023
-
[27]
K. Chen, J. Li, K. Wang, Y . Du, J. Yu, J. Lu, L. Li, J. Qiu, J. Pan, Y . Huang et al. , “Chemist-X: large language model-empowered agent for reaction condition recommendation in chemical synthesis,” arXiv preprint arXiv:2311.10776, 2023
-
[28]
Su- pervised contrastive learning,
P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Su- pervised contrastive learning,” Advances in neural infor- mation processing systems , vol. 33, pp. 18 661–18 673, 2020
work page 2020
-
[29]
A con- trastive learning approach for training variational autoen- coder priors,
J. Aneja, A. Schwing, J. Kautz, and A. Vahdat, “A con- trastive learning approach for training variational autoen- coder priors,” Advances in neural information processing systems, vol. 34, pp. 480–493, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.