Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement
Pith reviewed 2026-05-10 10:42 UTC · model grok-4.3
The pith
Dr. RTL deploys a multi-agent system to rewrite RTL code, evaluate changes with real tools, and distill successes into a reusable skill library, achieving 21 percent better worst negative slack on 20 industrial designs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dr. RTL performs closed-loop optimization through a multi-agent framework for critical-path analysis, parallel RTL rewriting, and tool-based evaluation. It introduces group-relative skill learning, which compares parallel RTL rewrites and distills the optimization experience into an interpretable skill library containing 47 pattern-strategy entries for cross-design reuse. Evaluated on 20 real-world RTL designs, it achieves average WNS/TNS improvements of 21 percent and 17 percent with a 6 percent area reduction over the industry-leading commercial synthesis tool.
What carries the argument
A multi-agent closed-loop system that analyzes critical paths, generates parallel RTL rewrites, scores them with commercial EDA tools, and applies group-relative skill learning to build and reuse an interpretable library of optimization patterns and strategies.
Load-bearing premise
The multi-agent closed-loop process with parallel rewriting and tool-grounded evaluation produces genuinely superior and generalizable optimizations rather than improvements tied to the specific 20 designs or the particular commercial tool baseline.
What would settle it
Applying Dr. RTL to a fresh collection of RTL designs never seen during its skill-library construction and checking whether the reported WNS, TNS, and area gains still appear when measured against the same commercial synthesis tool.
Figures
read the original abstract
Recent advances in large language models (LLMs) have sparked growing interest in automatic RTL optimization for better performance, power, and area (PPA). However, existing methods are still far from realistic RTL optimization. Their evaluation settings are often unrealistic: they are tested on manually degraded, small-scale RTL designs and rely on weak open-source tools. Their optimization methods are also limited, relying on coarse design-level feedback and simple pre-defined rewriting rules. To address these limitations, we present Dr. RTL, an agentic framework for RTL timing optimization in a realistic evaluation environment, with continual self-improvement through reusable optimization skills. We establish a realistic evaluation setting with more challenging RTL designs and an industrial EDA workflow. Within this setting, Dr. RTL performs closed-loop optimization through a multi-agent framework for critical-path analysis, parallel RTL rewriting, and tool-based evaluation. We further introduce group-relative skill learning, which compares parallel RTL rewrites and distills the optimization experience into an interpretable skill library. Currently, this library contains 47 pattern--strategy entries for cross-design reuse to improve PPA and accelerate convergence, and it can continue evolving over time. Evaluated on 20 real-world RTL designs, Dr. RTL achieves average WNS/TNS improvements of 21%/17% with a 6% area reduction over the industry-leading commercial synthesis tool.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Dr. RTL, a multi-agent LLM framework for autonomous RTL timing optimization. It performs closed-loop critical-path analysis, parallel rewriting, and tool-grounded evaluation, augmented by group-relative skill learning that distills experience into a reusable, interpretable library of 47 pattern-strategy entries. In a claimed realistic industrial EDA setting, the system is evaluated on 20 real-world RTL designs and reports average improvements of 21% WNS, 17% TNS, and 6% area reduction relative to an industry-leading commercial synthesis tool.
Significance. If the performance deltas are shown to arise from the agentic loop and skill library rather than from design selection or baseline configuration, the work would constitute a meaningful step toward self-improving, tool-integrated agents for hardware design. The emphasis on cross-design reuse via an explicit skill library and the use of real commercial EDA flows are positive features that distinguish it from prior LLM-based RTL work.
major comments (3)
- [§5 and abstract] §5 (Evaluation) and abstract: The headline claim of 21%/17% WNS/TNS and 6% area gains is presented as averages over 20 designs, yet no per-design results, standard deviations, number of independent runs, or statistical significance tests are reported. Given the stochastic nature of LLM agents and parallel rewriting, this information is required to establish that the improvements are reproducible and attributable to the method rather than run variance.
- [§5.1] §5.1 (Design selection and baseline): No selection criteria, size distribution, or domain diversity statistics are given for the 20 real-world designs, nor is it stated whether the commercial tool baseline was run with maximum effort (all optimization passes, highest effort settings, equivalent runtime budgets). These omissions are load-bearing because the skeptic concern—that gains may reflect under-optimized baselines or favorably chosen designs—cannot be ruled out without them.
- [§4.3] §4.3 (Group-relative skill learning): The 47-entry skill library is constructed from tool feedback on the same 20 designs used for final evaluation, with no mention of held-out designs, temporal separation, or cross-validation. This directly affects the central claim of generalizable cross-design reuse; without such separation the reported acceleration and PPA benefits risk circularity.
minor comments (2)
- [§3.2] The description of how parallel rewrites are ranked and distilled into the skill library (group-relative comparison) would benefit from a short pseudocode listing or explicit scoring formula.
- [§5] Figure captions and axis labels in the experimental section should explicitly state whether reported numbers are single-run or averaged, and whether area is post-synthesis or post-P&R.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We appreciate the emphasis on evaluation rigor, baseline fairness, and avoiding circularity in the skill library. We address each major comment point-by-point below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§5 and abstract] §5 (Evaluation) and abstract: The headline claim of 21%/17% WNS/TNS and 6% area gains is presented as averages over 20 designs, yet no per-design results, standard deviations, number of independent runs, or statistical significance tests are reported. Given the stochastic nature of LLM agents and parallel rewriting, this information is required to establish that the improvements are reproducible and attributable to the method rather than run variance.
Authors: We agree that aggregate averages alone are insufficient given the stochastic elements in our framework. In the revised manuscript we will add a table in §5 reporting per-design WNS, TNS, and area deltas. We will also rerun the full evaluation over five independent trials per design, report means with standard deviations, and include paired statistical significance tests (e.g., Wilcoxon signed-rank) against the commercial baseline. These additions will be summarized in the abstract as well. revision: yes
-
Referee: [§5.1] §5.1 (Design selection and baseline): No selection criteria, size distribution, or domain diversity statistics are given for the 20 real-world designs, nor is it stated whether the commercial tool baseline was run with maximum effort (all optimization passes, highest effort settings, equivalent runtime budgets). These omissions are load-bearing because the skeptic concern—that gains may reflect under-optimized baselines or favorably chosen designs—cannot be ruled out without them.
Authors: We will expand §5.1 with a new table listing the 20 designs, their gate counts (ranging 8k–620k), and domain categories (CPU cores, DSP blocks, interconnect, etc.). Selection was performed in collaboration with industrial partners to ensure realism; we will state the exact criteria. For the baseline, the commercial tool was invoked with the highest effort preset, all optimization passes enabled, and total runtime budget matched to the cumulative tool invocations of Dr. RTL. These settings will be documented explicitly. revision: yes
-
Referee: [§4.3] §4.3 (Group-relative skill learning): The 47-entry skill library is constructed from tool feedback on the same 20 designs used for final evaluation, with no mention of held-out designs, temporal separation, or cross-validation. This directly affects the central claim of generalizable cross-design reuse; without such separation the reported acceleration and PPA benefits risk circularity.
Authors: We acknowledge the circularity risk. The library is built incrementally from tool feedback during optimization. In the revision we will add an explicit description of the construction process, include a leave-one-out ablation (library for each design built without that design’s feedback), and report the resulting PPA and convergence metrics. We will also discuss the limitation that a fully external held-out corpus was not available and note this as an area for future work with larger design collections. revision: partial
Circularity Check
No significant circularity in the derivation or claims
full rationale
The paper's central results are empirical measurements of WNS/TNS and area improvements on 20 real-world RTL designs against an external industry-leading commercial synthesis tool. The multi-agent framework, parallel rewriting, tool-grounded evaluation, and group-relative skill learning (distilling 47 pattern-strategy entries from tool feedback) are all grounded in external EDA tool outputs rather than internal fitted parameters, self-definitions, or self-citation chains. No equations, derivations, or load-bearing steps reduce the reported deltas to inputs by construction; the evaluation setting uses realistic industrial workflows and external benchmarks, rendering the claims self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large language models can reliably identify critical paths and generate functionally correct RTL rewrites when given tool feedback
- domain assumption Industrial EDA tool outputs provide unbiased and sufficiently precise PPA metrics for guiding optimization
invented entities (1)
-
Group-relative skill library
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Mohammad Akyash, Kimia Azar, and Hadi Kamali. 2025. Rtl++: Graph-enhanced llm for rtl code generation. In2025 IEEE International Conference on LLM-Aided Design (ICLAD). IEEE, 44–50
work page 2025
- [2]
-
[3]
Anthropic. 2026. Claude Code: Model Configuration. https://code.claude.com/ docs/en/model-config
work page 2026
-
[4]
Alan Brayton, Robert Mishchenko, and A Mishchenko. 2006. Scalable logic synthesis using a simple circuit structure. InInternational Workshop on Logic and Synthesis (IWLS)
work page 2006
-
[5]
Che-Ming Chang, Prashanth Vijayaraghavan, Ashutosh Jadhav, Charles Mackin, Hsinyu Tsai, Vandana Mukherjee, and Ehsan Degan. 2026. CODMAS: A Dialec- tic Multi-Agent Collaborative Framework for Structured RTL Optimization. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track...
work page 2026
-
[6]
Chen Chen, Guangyu Hu, Dongsheng Zuo, Cunxi Yu, Yuzhe Ma, and Hongce Zhang. 2024. E-syn: E-graph rewriting with technology-aware cost functions for logic synthesis. InDesign Automation Conference (DAC)
work page 2024
-
[7]
Lei Chen et al. 2024. The Dawn of AI-Native EDA: Promises and Challenges of Large Circuit Models.Springer Science China Information Sciences (SCIS)(2024)
work page 2024
-
[8]
Luanrong Chen, Renzhi Chen, Xinyu Li, Shanshan Li, Rui Gong, and Lei Wang
- [9]
-
[10]
Animesh Basak Chowdhury, Marco Romanelli, Benjamin Tan, Ramesh Karri, and Siddharth Garg. 2024. Retrieval-guided reinforcement learning for boolean circuit minimization. InInternational Conference on Learning Representations (ICLR)
work page 2024
- [11]
-
[12]
Chenhui Deng, Yun-Da Tsai, Guan-Ting Liu, Zhongzhi Yu, and Haoxing Ren. 2025. Scalertl: Scaling llms with reasoning data and test-time compute for accurate rtl code generation. In2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD). IEEE, 1–9
work page 2025
- [13]
- [14]
-
[15]
Mingzhe Gao, Jieru Zhao, Zhe Lin, Wenchao Ding, Xiaofeng Hou, Yu Feng, Chao Li, and Minyi Guo. 2024. AutoVCoder: A Systematic Framework for Automated Verilog Code Generation using LLMs. InInternational Conference on Computer Design (ICCD)
work page 2024
-
[16]
2012.Logic synthesis and verification
Soha Hassoun and Tsutomu Sasao. 2012.Logic synthesis and verification. Vol. 654. Springer Science & Business Media
work page 2012
-
[17]
Zhuolun He, Yuan Pu, Haoyuan Wu, Tairu Qiu, and Bei Yu. 2025. Large language models for eda: Future or mirage?ACM Transactions on Design Automation of Electronic Systems30, 6 (2025), 1–53
work page 2025
- [18]
-
[19]
Miao Liu, Liwei Ni, Junfeng Liu, Xingyu Meng, Rui Wang, Xiaoze Lin, Xinhua Lai, Xingquan Li, and Jungang Xu. 2026. A Survey of Machine Learning Approaches in Logic Synthesis.ACM Transactions on Design Automation of Electronic Systems 31, 2 (2026), 1–43
work page 2026
- [20]
-
[21]
Shang Liu, Wenji Fang, Yao Lu, Qijun Zhang, Hongce Zhang, and Zhiyao Xie
-
[22]
RTLCoder: Fully Open-Source and Efficient LLM-Assisted RTL Code Generation Technique.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)(2024)
work page 2024
-
[23]
Shang Liu, Yao Lu, Wenji Fang, Mengming Li, and Zhiyao Xie. 2024. OpenLLM- RTL: Open Dataset and Benchmark for LLM-Aided Design RTL Generation. In International Conference on Computer-Aided Design (ICCAD)
work page 2024
-
[24]
Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. 2024. RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model. InAsia and South Pacific Design Automation Conference (ASP-DAC)
work page 2024
- [25]
- [26]
-
[27]
Jingyu Pan, Guanglei Zhou, Chen-Chia Chang, Isaac Jacobson, Jiang Hu, and Yiran Chen. 2025. A Survey of Research in Large Language Models for Elec- tronic Design Automation.ACM Transactions on Design Automation of Electronic Systems (TODAES)(2025)
work page 2025
- [28]
-
[29]
Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, and Haoxing Ren. 2025. Comprehensive Verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification.arXiv preprint arXiv:2506.14074(2025)
- [30]
-
[31]
Arun Ravindran, Aditya Patra, Vahid Babaey, and Suresh Purini. 2025. Survey and Benchmarking of Large Language Models for RTL Code Generation: Techniques and Open Challenges. (2025)
work page 2025
- [32]
-
[33]
Si2. 2018. NanGate 45nm Open Cell Library
work page 2018
-
[34]
Cadence Design Systems. 2026. Cadence JasperGold Sequential Equiva- lence Checking App. https://www.cadence.com/en_US/home/tools/system- design-and-verification/formal-and-static-verification/jasper-verification- platform/jaspergold-sequential-equivalence-checking-app.html
work page 2026
-
[35]
Kimia Tasnia, Alexander Garcia, Tasnuva Farheen, and Sazadur Rahman. 2025. Veriopt: Ppa-aware high-quality verilog generation via multi-role llms. In2025 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1–9
work page 2025
-
[36]
Kiran Thorat, Jiahui Zhao, Yaotian Liu, Amit Hasan, Hongwu Peng, Xi Xie, Bin Lei, and Caiwen Ding. 2025. LLM-VeriPPA: Power, Performance, and Area Optimization aware Verilog Code Generation with Large Language Models. In 2025 ACM/IEEE 7th Symposium on Machine Learning for CAD (MLCAD). IEEE, 1–7
work page 2025
- [37]
-
[38]
2010.Digital design with RTL design, VHDL, and Verilog
Frank Vahid. 2010.Digital design with RTL design, VHDL, and Verilog. John Wiley & Sons
work page 2010
-
[39]
Yiting Wang, Wanghao Ye, Ping Guo, Yexiao He, Ziyao Wang, Bowei Tian, Shwai He, Guoheng Sun, Zheyu Shen, Sihan Chen, et al . 2025. Symrtlo: Enhancing rtl code optimization with llms and neuron-inspired symbolic reasoning. In Advances in Neural Information Processing Systems (NeurIPS)
work page 2025
- [40]
-
[41]
Xufeng Yao, Yiwen Wang, Xing Li, Yingzhao Lian, Ran Chen, Lei Chen, Mingxuan Yuan, Hong Xu, and Bei Yu. 2024. Rtlrewriter: Methodologies for large models aided rtl code optimization. InProceedings of IEEE/ACM International Conference on Computer-Aided Design (ICCAD)
work page 2024
-
[42]
Jiaqi Yin, Zhan Song, Chen Chen, Qihao Hu, and Cunxi Yu. 2025. Boole: Exact symbolic reasoning via boolean equality saturation. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7
work page 2025
-
[43]
Zhongzhi Yu, Mingjie Liu, Michael Zimmer, Yingyan Celine, Yong Liu, and Haoxing Ren. 2025. Spec2rtl-agent: Automated hardware code generation from complex specifications using llm agent systems. In2025 IEEE International Con- ference on LLM-Aided Design (ICLAD). IEEE, 37–43
work page 2025
- [44]
- [45]
- [46]
-
[47]
Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao. 2025. Mage: A multi-agent engine for automated rtl code generation. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7
work page 2025
-
[48]
Matthew M Ziegler, Hung-Yi Liu, George Gristede, Bruce Owens, Ricardo Ni- gaglioni, and Luca P Carloni. 2016. A synthesis-parameter tuning system for autonomous design-space exploration. In2016 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1148–1151. 9
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.