ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Chenyang Zhou; Haotian Ye; Hejia Zhang; Jingbo Shang; Jishen Zhao; Junxia Cui; Kun Zhou; Ruiyi Wang; Yichen Lin; Yufei Ding

arxiv: 2605.12857 · v1 · pith:DSA5VWAFnew · submitted 2026-05-13 · 💻 cs.MA · cs.AI· cs.AR· cs.LG

ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Zhongkai Yu , Yichen Lin , Chenyang Zhou , Yuwei Zhang , Kun Zhou , Junxia Cui , Haotian Ye , Zhengding Hu

show 7 more authors

Zaifeng Pan Ruiyi Wang Yujie Zhao Hejia Zhang Jingbo Shang Jishen Zhao Yufei Ding

This is my paper

Pith reviewed 2026-06-30 21:56 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.ARcs.LG

keywords multi-agent systemsreinforcement learningRTL generationVerilogmutual verificationself-trained modelshardware design

0 comments

The pith

A multi-agent system trains a Verilog agent and a Python reference model agent to mutually verify each other and generate correct RTL code without golden oracles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to align AI tools for hardware description generation with industrial chip design constraints such as data privacy and the absence of pre-existing testbenches. It does this by creating two agents that produce Verilog modules and Python reference models respectively, then use direct cross-comparison to establish correctness. A backtrack-based inference process limits error buildup across turns, while a two-stage training sequence first strengthens individual code generation before teaching the agents to collaborate through reinforcement learning. The resulting framework produces training data internally and reaches high benchmark accuracy with base models of modest size.

Core claim

ChipMATE shows that correctness in RTL generation can arise from mutual verification between a Verilog agent and a Python reference-model agent without any golden oracle. The framework pairs the agents in a backtrack-based workflow, builds 64.4K reference model samples through hybrid generation, and applies a two-stage pipeline of individual saturation followed by joint team training to produce effective collaboration.

What carries the argument

Mutual verification between the Verilog agent and the Python reference-model agent, which replaces external testbenches as the source of correctness signals during both training and inference.

If this is right

RTL generation becomes feasible inside air-gapped environments using only proprietary internal data.
Verification can be embedded inside the generation loop rather than applied afterward.
Backtracking during inference prevents error accumulation across multiple agent turns.
Joint training after individual preparation improves collaboration beyond what separate agents achieve alone.
Smaller base models reach competitive accuracy on RTL benchmarks through this collaborative training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar mutual verification between agents in different representations could apply to other code generation domains that involve cross-language checks.
The staged training plus backtrack workflow may generalize to other multi-turn agent tasks where early errors compound.
Embedding cross-verification practices from industry directly into model training loops could improve how AI systems handle real deployment constraints.

Load-bearing premise

Mutual verification between an independently written Verilog module and a Python reference model is sufficient to produce correct RTL without a golden oracle or external testbench.

What would settle it

An independent functional test or simulation that finds Verilog code accepted by the mutual verification process but still incorrect would show the verification method does not guarantee correctness.

Figures

Figures reproduced from arXiv: 2605.12857 by Chenyang Zhou, Haotian Ye, Hejia Zhang, Jingbo Shang, Jishen Zhao, Junxia Cui, Kun Zhou, Ruiyi Wang, Yichen Lin, Yufei Ding, Yujie Zhao, Yuwei Zhang, Zaifeng Pan, Zhengding Hu, Zhongkai Yu.

**Figure 2.** Figure 2: (a) Standard GRPO collapses to group size 1 across multi-turn rollouts when dense per-turn [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Hybrid data generation framework combining [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Pass@k of different variant of ChipMATE-4B. The agentic workflow consistently lands between the two single agents. Base Base + SFT Base + SFT + RL Agents (w/o BT) Agents Agents + MA-RL 40 50 60 70 80 90 VerilogEval v2 pass@1 (%) Single-agent training Two-agent inference Multi-agent RL training ChipMATE-4B 41.7 ChipMATE-9B 60.9 67.4 55.8 71.8 75.0 48.5 70.5 74.4 60.1 78.7 80.1 [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 5.** Figure 5: Ablation study on VerilogEval v2 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self-trained models address the deployment constraint but remain single-turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross-comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack-based inference workflow to prevent error propagation across turns, and a two-stage training pipeline that first trains each agent individually to saturate its code-generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data-generation framework that produces 64.4K high-quality reference model training samples. ChipMATE achieves 75.0\% and 80.1\% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self-trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChipMATE's mutual-verification loop and two-stage training produce solid numbers on VerilogEval V2 with small models, but the paper needs to show the verification signal is robust rather than just correlated.

read the letter

The core claim is that a Verilog agent and a Python reference-model agent can train each other through mutual checks without any golden testbench, using a two-stage individual-then-joint pipeline plus backtrack inference. This gets 75% and 80.1% pass@1 on VerilogEval V2 with 4B and 9B models, beating prior self-trained systems and even much larger ones. They also release code and weights, which is straightforward to check.

What the work actually adds is the explicit separation of individual saturation training from joint collaboration training, plus the backtrack workflow to limit error carry-over across turns. The hybrid data pipeline that produced 64.4K reference-model samples is a concrete engineering step that addresses the air-gapped and proprietary-data constraints mentioned in the abstract. Those pieces line up with real deployment friction in chip design.

The soft spots sit in the evaluation. The abstract states the headline numbers but supplies no protocol details, baseline reproduction steps, ablation results on the mutual-verification or backtrack components, or any check for statistical significance. That makes it hard to tell how much of the lift comes from the multi-agent loop versus simply better data or longer training. The stress-test worry about correlated errors between the two agents is not obviously answered by the high-level description; if both agents share the same misunderstanding of timing or edge cases, agreement can reinforce the wrong answer. The paper would need to demonstrate that this failure mode is rare in practice.

This is aimed at people working on LLM-assisted RTL generation inside EDA teams that cannot rely on external APIs or public testbenches. Readers who already follow VerilogEval and self-trained code models will get the most direct value.

The combination of public benchmark results and released artifacts is enough to justify sending it to referees, even though the current write-up leaves the verification robustness question open.

Referee Report

3 major / 2 minor

Summary. The paper presents ChipMATE, a self-trained multi-agent RL framework for RTL generation. A Verilog agent is paired with a Python reference-model agent that mutually verify outputs without a golden oracle or external testbench; a backtrack-based inference workflow prevents error propagation, and a two-stage training pipeline first saturates individual code-generation capability before joint collaboration training. A hybrid data-generation process yields 64.4K reference-model samples. On VerilogEval V2 the method reports 75.0% and 80.1% pass@1 with 4B and 9B base models, outperforming prior self-trained systems and even DeepSeek V4 (1600B parameters). Code and weights are released publicly.

Significance. If the mutual-verification signal proves reliable, the work would be significant for enabling air-gapped, proprietary-data-compatible RTL agents that align with industrial constraints. The public release of code, weights, and the 64.4K-sample dataset is a clear strength supporting reproducibility and further research.

major comments (3)

[§3] The central performance claims rest on the claim that mutual verification between the Verilog and Python agents supplies a reliable correctness signal without an oracle (abstract and §3). The manuscript provides no experiment or analysis testing robustness when the two agents share correlated misconceptions of the specification (e.g., identical mishandling of an edge case or timing semantics); such a test is load-bearing because agreement could reinforce rather than correct errors.
[§5] §5 (results) and the associated tables report pass@1 figures but supply no ablation isolating the contribution of the backtrack workflow or the joint-training stage versus the individual-training stage alone; without these controls it is impossible to determine whether the two-stage pipeline is responsible for the reported gains over single-turn baselines.
[§5] The experimental protocol (number of evaluation runs, temperature settings, exact definition of pass@1 on VerilogEval V2, and statistical significance of the 75.0%/80.1% figures versus DeepSeek V4) is not stated with sufficient detail to allow independent verification of the headline numbers.

minor comments (2)

[§3] Notation for the two agents and the reward formulation in the RL objective should be introduced once with consistent symbols across sections.
Figure captions should explicitly state the base-model sizes and whether results are averaged over multiple seeds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the mutual-verification approach, add necessary ablations, and clarify the experimental protocol.

read point-by-point responses

Referee: [§3] The central performance claims rest on the claim that mutual verification between the Verilog and Python agents supplies a reliable correctness signal without an oracle (abstract and §3). The manuscript provides no experiment or analysis testing robustness when the two agents share correlated misconceptions of the specification (e.g., identical mishandling of an edge case or timing semantics); such a test is load-bearing because agreement could reinforce rather than correct errors.

Authors: We agree this is a substantive concern. The two-stage pipeline (individual saturation followed by joint training) and the use of distinct languages (Verilog vs. Python) are intended to reduce error correlation, but we did not explicitly quantify robustness to shared misconceptions. In the revision we will add a dedicated paragraph in §3 analyzing failure modes on VerilogEval V2 where both agents produce identical incorrect outputs, report the frequency of such cases relative to single-agent baselines, and discuss implications for the reliability of the mutual-verification signal. revision: yes
Referee: [§5] §5 (results) and the associated tables report pass@1 figures but supply no ablation isolating the contribution of the backtrack workflow or the joint-training stage versus the individual-training stage alone; without these controls it is impossible to determine whether the two-stage pipeline is responsible for the reported gains over single-turn baselines.

Authors: We acknowledge the absence of these controls limits interpretability. We will add an ablation table in §5 reporting pass@1 for (i) individual training only, (ii) joint training without backtrack, and (iii) the full two-stage pipeline with backtrack, using the same 4B and 9B models and evaluation settings. This will isolate the incremental benefit of each component. revision: yes
Referee: [§5] The experimental protocol (number of evaluation runs, temperature settings, exact definition of pass@1 on VerilogEval V2, and statistical significance of the 75.0%/80.1% figures versus DeepSeek V4) is not stated with sufficient detail to allow independent verification of the headline numbers.

Authors: We agree the protocol description is insufficient. In the revised §5 we will specify: 5 independent evaluation runs with temperature 0.2 and top-p 0.95; pass@1 defined exactly as in the VerilogEval V2 benchmark (first sample must pass all tests); and a note that the DeepSeek V4 comparison uses the numbers reported in its original paper (no direct re-evaluation performed). We will also report standard deviation across the 5 runs for our results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results on external benchmark

full rationale

The paper reports pass@1 metrics on the independent public benchmark VerilogEval V2 rather than on any internally fitted or self-defined quantities. The mutual-verification mechanism is a component of the proposed training/inference workflow, but the correctness signal used for final claims is supplied by the external benchmark's test cases. No equations, self-citations, or parameter-fitting steps are described that would reduce the headline performance numbers to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central performance claim rests on the unverified premise that mutual verification plus two-stage training produces reliable correctness; the abstract supplies no independent evidence for this premise beyond the benchmark numbers.

axioms (1)

domain assumption Mutual verification between independently generated Verilog and Python models can substitute for a golden oracle in RTL correctness checking
Explicitly invoked as the core industrial-practice inspiration for the framework.

invented entities (1)

Backtrack-based inference workflow no independent evidence
purpose: Prevent error propagation across multi-turn agent interactions
Introduced as a novel component of the inference procedure.

pith-pipeline@v0.9.1-grok · 5865 in / 1358 out tokens · 39904 ms · 2026-06-30T21:56:52.777356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Evaluating large language models trained on code, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code, 2021

2021
[2]

SiliconMind-V1: Multi-agent distillation and debug-reasoning workflows for Verilog code generation, 2026

Mu-Chi Chen et al. SiliconMind-V1: Multi-agent distillation and debug-reasoning workflows for Verilog code generation, 2026

2026
[3]

Teaching large language models to self-debug, 2023

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023

2023
[4]

ChipSeek-R1: Generating human-surpassing RTL with LLM via hierarchical reward-driven reinforcement learning, 2025

Zhirong Chen, Ying Wang, et al. ChipSeek-R1: Generating human-surpassing RTL with LLM via hierarchical reward-driven reinforcement learning, 2025

2025
[5]

AutoVCoder: A systematic framework for automated Verilog code generation using LLMs, 2024

Mingzhe Gao, Jieru Zhao, Zhe Lin, Wenchao Ding, Xiaofeng Hou, Yu Feng, Chao Li, and Minyi Guo. AutoVCoder: A systematic framework for automated Verilog code generation using LLMs, 2024

2024
[6]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

2025
[7]

OriGen: Enhancing RTL code generation with code- to-code augmentation and self-reflection

Fan He, Jiahui Zhao, Peiyu Shi, et al. OriGen: Enhancing RTL code generation with code- to-code augmentation and self-reflection. InProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2024

2024
[8]

VerilogCoder: Autonomous Verilog coding agents with graph-based planning and abstract syntax tree (AST)-based waveform tracing tool

Yu-Ching Ho, Mark Haoxing Ren, and Brucek Khailany. VerilogCoder: Autonomous Verilog coding agents with graph-based planning and abstract syntax tree (AST)-based waveform tracing tool. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[9]

MetaGPT: Meta program- ming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta program- ming for a multi-agent collaborative framework. InInternational Conference on Learning Representations (ICLR), 2024

2024
[10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models, 2021. 10

2021
[11]

AIvril: AI-driven RTL generation with verification in-the-loop, 2024

Mubashir ul Islam, Humza Sami, and Pierre-Emmanuel Gaillardon. AIvril: AI-driven RTL generation with verification in-the-loop, 2024

2024
[12]

CraftRTL: High-quality synthetic data generation for Verilog code models with correct-by-construction non-textual representations and targeted code repair, 2024

Mingjie Liu, Yun-Da Tsai, et al. CraftRTL: High-quality synthetic data generation for Verilog code models with correct-by-construction non-textual representations and targeted code repair, 2024

2024
[13]

Shang Liu, Wenji Fang, Yao Lu, Qijun Zhang, Hongce Zhang, and Zhiyao Zhang. RTL- Coder: Fully open-source and efficient LLM-assisted RTL code generation technique.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 44(4):1448–1461, 2025

2025
[14]

OpenLLM-RTL: Open dataset and benchmark for LLM-aided design RTL generation

Shang Liu, Yao Lu, Wenji Fang, Mengming Li, and Zhiyao Xie. OpenLLM-RTL: Open dataset and benchmark for LLM-aided design RTL generation. In2024 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2024. Includes RTLLM v2.0 with 50 hand-crafted designs

2024
[15]

Multi-agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017
[16]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[17]

CoopetitiveV: Leveraging LLM-powered coopetitive multi-agent prompting for high-quality Verilog generation, 2024

Zhendong Mi, Renming Zheng, and Haowen Zhong. CoopetitiveV: Leveraging LLM-powered coopetitive multi-agent prompting for high-quality Verilog generation, 2024

2024
[18]

BetterV: Controlled Verilog generation with discriminative guidance

Zehua Pei, Hui-Ling Zhen, Mingxuan Li, Jianye Hao, and Mingzhi Yuan. BetterV: Controlled Verilog generation with discriminative guidance. InProceedings of the International Conference on Machine Learning (ICML), 2024

2024
[19]

Revisiting VerilogEval: A year of improvements in large-language models for hardware code generation.arXiv preprint arXiv:2408.11053, 2024

Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, and Brucek Khailany. Revisiting VerilogEval: A year of improvements in large-language models for hardware code generation.arXiv preprint arXiv:2408.11053, 2024

work page arXiv 2024
[20]

Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, and Haoxing Ren. Comprehensive Verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on RTL design and verifica- tion.arXiv preprint arXiv:2506.14074, 2025

work page arXiv 2025
[21]

ChatDev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. ChatDev: Communicative agents for software development. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

2024
[22]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024

2024
[23]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Re- flexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[25]

Pyverilog: A python-based hardware design processing toolkit for Verilog HDL

Shinya Takamaeda-Yamazaki. Pyverilog: A python-based hardware design processing toolkit for Verilog HDL. InInternational Symposium on Applied Reconfigurable Computing (ARC), volume 9040 ofLecture Notes in Computer Science, pages 451–460. Springer, 2015

2015
[26]

Qwen3.5: More intelligence, less compute, 2026

Qwen Team. Qwen3.5: More intelligence, less compute, 2026. https://huggingface.co/ Qwen/Qwen3.5-9B. 11

2026
[27]

AutoChip: Automating HDL generation using LLM feedback, 2023

Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. AutoChip: Automating HDL generation using LLM feedback, 2023

2023
[28]

VeriReason: Reinforcement learning with testbench feedback for reasoning- enhanced Verilog generation, 2025

Yiting Wang et al. VeriReason: Reinforcement learning with testbench feedback for reasoning- enhanced Verilog generation, 2025

2025
[29]

Chain-of-thought prompting elicits reasoning in large language models, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022

2022
[30]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[31]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

2023
[32]

ChipBench: A next-step benchmark for evaluating LLM performance in AI-aided chip design.arXiv preprint arXiv:2601.21448, 2026

Zhongkai Yu, Chenyang Zhou, Yichen Lin, Hejia Zhang, Haotian Ye, Junxia Cui, Zaifeng Pan, Jishen Zhao, and Yufei Ding. ChipBench: A next-step benchmark for evaluating LLM performance in AI-aided chip design.arXiv preprint arXiv:2601.21448, 2026

work page arXiv 2026
[33]

RTLSeek: Boosting the LLM-based RTL generation with multi-stage diversity-oriented reinforcement learning, 2026

Xinyu Zhang et al. RTLSeek: Boosting the LLM-based RTL generation with multi-stage diversity-oriented reinforcement learning, 2026

2026
[34]

Stronger-MAS: Multi-agent reinforcement learning for collaborative LLMs, 2025

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-MAS: Multi-agent reinforcement learning for collaborative LLMs, 2025

2025
[35]

MAGE: A multi-agent engine for automated RTL code generation, 2024

Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao. MAGE: A multi-agent engine for automated RTL code generation, 2024

2024
[36]

LlamaFactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400–410, Bangkok, Thailand, 2024. Association for Computational Linguistics

2024
[37]

population count

Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, and Yunji Chen. QiMeng-CodeV-R1: Reasoning-enhanced Verilog generation.arXiv preprint arXiv:2505.24183, 2025. A Agent Prompt Templates We li...

work page arXiv 2025
[38]

Stage 1: Compute a + b and c - d, and store the value of d. 13
[39]

Stage 2: Compute the sum of the results from Stage 1
[40]

a", 0) & 0x3FF b = inputs.get(

Stage 3: Compute the product of the result from Stage 2 and the stored value of d. The ALU should use pipelining to ensure that each stage operates independently, and the final result should be available at the output F. This Python class, named alu_op, has the interface designed in a port table (signal name, direction, width, description) for each of a, ...

2000

[1] [1]

Evaluating large language models trained on code, 2021

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code, 2021

2021

[2] [2]

SiliconMind-V1: Multi-agent distillation and debug-reasoning workflows for Verilog code generation, 2026

Mu-Chi Chen et al. SiliconMind-V1: Multi-agent distillation and debug-reasoning workflows for Verilog code generation, 2026

2026

[3] [3]

Teaching large language models to self-debug, 2023

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug, 2023

2023

[4] [4]

ChipSeek-R1: Generating human-surpassing RTL with LLM via hierarchical reward-driven reinforcement learning, 2025

Zhirong Chen, Ying Wang, et al. ChipSeek-R1: Generating human-surpassing RTL with LLM via hierarchical reward-driven reinforcement learning, 2025

2025

[5] [5]

AutoVCoder: A systematic framework for automated Verilog code generation using LLMs, 2024

Mingzhe Gao, Jieru Zhao, Zhe Lin, Wenchao Ding, Xiaofeng Hou, Yu Feng, Chao Li, and Minyi Guo. AutoVCoder: A systematic framework for automated Verilog code generation using LLMs, 2024

2024

[6] [6]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

Daya Guo, Dejian Yang, Haowei Zhang, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

2025

[7] [7]

OriGen: Enhancing RTL code generation with code- to-code augmentation and self-reflection

Fan He, Jiahui Zhao, Peiyu Shi, et al. OriGen: Enhancing RTL code generation with code- to-code augmentation and self-reflection. InProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2024

2024

[8] [8]

VerilogCoder: Autonomous Verilog coding agents with graph-based planning and abstract syntax tree (AST)-based waveform tracing tool

Yu-Ching Ho, Mark Haoxing Ren, and Brucek Khailany. VerilogCoder: Autonomous Verilog coding agents with graph-based planning and abstract syntax tree (AST)-based waveform tracing tool. InProceedings of the AAAI Conference on Artificial Intelligence, 2025

2025

[9] [9]

MetaGPT: Meta program- ming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta program- ming for a multi-agent collaborative framework. InInternational Conference on Learning Representations (ICLR), 2024

2024

[10] [10]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models, 2021. 10

2021

[11] [11]

AIvril: AI-driven RTL generation with verification in-the-loop, 2024

Mubashir ul Islam, Humza Sami, and Pierre-Emmanuel Gaillardon. AIvril: AI-driven RTL generation with verification in-the-loop, 2024

2024

[12] [12]

CraftRTL: High-quality synthetic data generation for Verilog code models with correct-by-construction non-textual representations and targeted code repair, 2024

Mingjie Liu, Yun-Da Tsai, et al. CraftRTL: High-quality synthetic data generation for Verilog code models with correct-by-construction non-textual representations and targeted code repair, 2024

2024

[13] [13]

Shang Liu, Wenji Fang, Yao Lu, Qijun Zhang, Hongce Zhang, and Zhiyao Zhang. RTL- Coder: Fully open-source and efficient LLM-assisted RTL code generation technique.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 44(4):1448–1461, 2025

2025

[14] [14]

OpenLLM-RTL: Open dataset and benchmark for LLM-aided design RTL generation

Shang Liu, Yao Lu, Wenji Fang, Mengming Li, and Zhiyao Xie. OpenLLM-RTL: Open dataset and benchmark for LLM-aided design RTL generation. In2024 IEEE/ACM International Conference on Computer Aided Design (ICCAD). IEEE, 2024. Includes RTLLM v2.0 with 50 hand-crafted designs

2024

[15] [15]

Multi-agent actor-critic for mixed cooperative-competitive environments

Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

2017

[16] [16]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[17] [17]

CoopetitiveV: Leveraging LLM-powered coopetitive multi-agent prompting for high-quality Verilog generation, 2024

Zhendong Mi, Renming Zheng, and Haowen Zhong. CoopetitiveV: Leveraging LLM-powered coopetitive multi-agent prompting for high-quality Verilog generation, 2024

2024

[18] [18]

BetterV: Controlled Verilog generation with discriminative guidance

Zehua Pei, Hui-Ling Zhen, Mingxuan Li, Jianye Hao, and Mingzhi Yuan. BetterV: Controlled Verilog generation with discriminative guidance. InProceedings of the International Conference on Machine Learning (ICML), 2024

2024

[19] [19]

Revisiting VerilogEval: A year of improvements in large-language models for hardware code generation.arXiv preprint arXiv:2408.11053, 2024

Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, and Brucek Khailany. Revisiting VerilogEval: A year of improvements in large-language models for hardware code generation.arXiv preprint arXiv:2408.11053, 2024

work page arXiv 2024

[20] [20]

Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, and Haoxing Ren. Comprehensive Verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on RTL design and verifica- tion.arXiv preprint arXiv:2506.14074, 2025

work page arXiv 2025

[21] [21]

ChatDev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. ChatDev: Communicative agents for software development. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

2024

[22] [22]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y .K. Li, Y . Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024

2024

[23] [23]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. HybridFlow: A flexible and efficient RLHF framework. arXiv preprint arXiv:2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Re- flexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Re- flexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[25] [25]

Pyverilog: A python-based hardware design processing toolkit for Verilog HDL

Shinya Takamaeda-Yamazaki. Pyverilog: A python-based hardware design processing toolkit for Verilog HDL. InInternational Symposium on Applied Reconfigurable Computing (ARC), volume 9040 ofLecture Notes in Computer Science, pages 451–460. Springer, 2015

2015

[26] [26]

Qwen3.5: More intelligence, less compute, 2026

Qwen Team. Qwen3.5: More intelligence, less compute, 2026. https://huggingface.co/ Qwen/Qwen3.5-9B. 11

2026

[27] [27]

AutoChip: Automating HDL generation using LLM feedback, 2023

Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. AutoChip: Automating HDL generation using LLM feedback, 2023

2023

[28] [28]

VeriReason: Reinforcement learning with testbench feedback for reasoning- enhanced Verilog generation, 2025

Yiting Wang et al. VeriReason: Reinforcement learning with testbench feedback for reasoning- enhanced Verilog generation, 2025

2025

[29] [29]

Chain-of-thought prompting elicits reasoning in large language models, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022

2022

[30] [30]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[31] [31]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

2023

[32] [32]

ChipBench: A next-step benchmark for evaluating LLM performance in AI-aided chip design.arXiv preprint arXiv:2601.21448, 2026

Zhongkai Yu, Chenyang Zhou, Yichen Lin, Hejia Zhang, Haotian Ye, Junxia Cui, Zaifeng Pan, Jishen Zhao, and Yufei Ding. ChipBench: A next-step benchmark for evaluating LLM performance in AI-aided chip design.arXiv preprint arXiv:2601.21448, 2026

work page arXiv 2026

[33] [33]

RTLSeek: Boosting the LLM-based RTL generation with multi-stage diversity-oriented reinforcement learning, 2026

Xinyu Zhang et al. RTLSeek: Boosting the LLM-based RTL generation with multi-stage diversity-oriented reinforcement learning, 2026

2026

[34] [34]

Stronger-MAS: Multi-agent reinforcement learning for collaborative LLMs, 2025

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, and Jishen Zhao. Stronger-MAS: Multi-agent reinforcement learning for collaborative LLMs, 2025

2025

[35] [35]

MAGE: A multi-agent engine for automated RTL code generation, 2024

Yujie Zhao, Hejia Zhang, Hanxian Huang, Zhongming Yu, and Jishen Zhao. MAGE: A multi-agent engine for automated RTL code generation, 2024

2024

[36] [36]

LlamaFactory: Unified efficient fine-tuning of 100+ language models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. LlamaFactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 400–410, Bangkok, Thailand, 2024. Association for Computational Linguistics

2024

[37] [37]

population count

Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, and Yunji Chen. QiMeng-CodeV-R1: Reasoning-enhanced Verilog generation.arXiv preprint arXiv:2505.24183, 2025. A Agent Prompt Templates We li...

work page arXiv 2025

[38] [38]

Stage 1: Compute a + b and c - d, and store the value of d. 13

[39] [39]

Stage 2: Compute the sum of the results from Stage 1

[40] [40]

a", 0) & 0x3FF b = inputs.get(

Stage 3: Compute the product of the result from Stage 2 and the stored value of d. The ALU should use pipelining to ensure that each stage operates independently, and the final result should be available at the output F. This Python class, named alu_op, has the interface designed in a port table (signal name, direction, width, description) for each of a, ...

2000