RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation

Jiahao Yang; Jianmin Ye; Xi Wang; Yifan Zhang

arxiv: 2604.24218 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.AI

RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation

Yifan Zhang , Jianmin Ye , Jiahao Yang , Xi Wang This is my paper

Pith reviewed 2026-05-08 03:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords multi-agent frameworkreference model generationSystemChardware modelingco-evolutionary verificationLLM agentsSoC designcontext optimization

0 comments

The pith

RefEvo deploys a dynamic multi-agent framework with co-evolutionary verification to produce reliable SystemC reference models for hardware designs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RefEvo to address three core problems when large language models generate hardware reference models: inflexible static workflows that ignore design complexity, loss of specification details across long interactions, and the risk that flawed models pass because their testbenches contain matching errors. It proposes an adaptive planner that builds custom workflows for each specification, a verification process in which a dialectical arbiter revises both the model and its testbench against the original specification at once, and a compression technique that keeps every requirement visible without exhausting context windows. On a benchmark of twenty hardware modules the system reaches a 95 percent pass rate while cutting token consumption by 71 percent on average and preserving complete specification recall. These outcomes would let hardware teams create high-fidelity early models faster and with fewer wasted LLM calls.

Core claim

RefEvo is a multi-agent framework that uses a Dynamic Design Planner to decompose specifications and build tailored execution flows, a Co-Evolutionary Verification Mechanism in which a Dialectical Arbiter jointly corrects the generated SystemC model and its testbench against the specification oracle to eliminate false positives, and a Spec Anchoring Strategy that compresses context without losing requirements, achieving a 95 percent pass rate and 71.04 percent average token reduction on twenty diverse hardware modules.

What carries the argument

The Co-Evolutionary Verification Mechanism, which employs a Dialectical Arbiter to simultaneously rectify the generated model and its testbench against the specification oracle.

If this is right

Hardware teams can generate reference models that adapt automatically to varying design complexity without manual workflow tuning.
The joint correction of model and testbench reduces the chance that correlated hallucinations produce falsely passing verification.
Token costs for iterative reference-model sessions drop by roughly 70 percent while every specification detail remains available.
Early SoC architecture exploration gains from higher-fidelity models produced earlier in the design cycle.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-evolution pattern could be tested in domains outside hardware where both implementation and validation code must stay consistent, such as software test generation.
Extending the benchmark to complete system-on-chip designs with hundreds of interacting modules would show whether the reported efficiency and accuracy scale.
If the arbiter reliably breaks correlated error cycles, similar mechanisms might improve reliability in general-purpose LLM coding agents that must produce both code and tests.

Load-bearing premise

That the dialectical arbiter can fix errors in both the model and testbench at the same time without creating new matching mistakes, and that results from twenty modules represent the demands of full-scale SoC projects.

What would settle it

Applying the system to a new collection of modules that contain known subtle specification conflicts and finding that pass rates fall below 80 percent or that token savings coincide with dropped requirements.

Figures

Figures reproduced from arXiv: 2604.24218 by Jiahao Yang, Jianmin Ye, Xi Wang, Yifan Zhang.

**Figure 1.** Figure 1: Performance capability of state-of-the-art LLMs using optimized view at source ↗

**Figure 2.** Figure 2: The logical architecture of RefEvo. (A) The Dynamic Planning phase where Agent 1 analyzes complexity to construct an execution plan. (B) The view at source ↗

**Figure 3.** Figure 3: End-to-End Success Rate across different models and modes. RefEvo consistently outperforms baselines. view at source ↗

**Figure 4.** Figure 4: Comparison of context management strategies. Spec Anchoring pins view at source ↗

**Figure 5.** Figure 5: Failure distribution breakdown. The transition from Flow to FixedTB eliminates compilation errors, while the transition to RefEvo resolves functional view at source ↗

**Figure 6.** Figure 6: Methodological Robustness Analysis. The consistent upward trend across all models confirms that RefEvo effectively enhances generation reliability view at source ↗

**Figure 7.** Figure 7: Token consumption comparison across Simple, Medium, and Complex view at source ↗

read the original abstract

As the complexity of System-on-Chip (SoC) designs grows, the shift-left paradigm necessitates the rapid development of high-fidelity reference models (typically written in SystemC) for early architecture exploration and verification. While Large Language Models (LLMs) show promise in code generation, their application to hardware modeling faces unique challenges: (1) Rigid, static workflows fail to adapt to varying design complexity, causing inefficiency; (2) Context window overflow in multi-turn interactions leads to catastrophic forgetting of critical specifications; and (3) the Coupled Validation Failure problem--where generated Testbenches (TBs) incorrectly validate flawed models due to correlated hallucinations--severely undermines reliability. To address these limitations, we introduce RefEvo, a dynamic multi-agent framework designed for agile and reliable reference modeling. RefEvo features three key innovations: (1) A Dynamic Design Planner that autonomously decomposes design specifications and constructs tailored execution workflows based on semantic complexity; (2) A Co-Evolutionary Verification Mechanism, which employs a Dialectical Arbiter to simultaneously rectify the model and verification logic against the specification (Spec) oracle, effectively mitigating false positives; and (3) A Spec Anchoring Strategy for lossless context compression. Evaluated on a diverse benchmark of 20 hardware modules, RefEvo achieves a 95% pass rate, outperforming static baselines by a large margin. Furthermore, our context optimization reduces token consumption by an average of 71.04%, achieving absolute savings of over 70,000 tokens per session for complex designs while maintaining 100% specification recall.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RefEvo packages a dynamic planner, spec anchoring, and a dialectical arbiter into an agentic loop for SystemC reference models, with reported 95% pass rate and 71% token cuts, but the co-evolution fix lacks the ablations needed to confirm it breaks error correlation.

read the letter

The paper's core contribution is a three-part agent framework that tries to make LLM-based hardware modeling more reliable than static prompting. The Dynamic Design Planner breaks specs into complexity-aware workflows, the Spec Anchoring Strategy compresses context without losing requirements, and the Dialectical Arbiter runs a co-evolutionary loop that edits both the model and its testbench against the original spec oracle. On a 20-module benchmark the system hits 95% pass rate and cuts tokens by 71% on average while keeping full recall. Those numbers, if they replicate, would be useful for teams doing early SoC exploration where manual reference-model work is a bottleneck. The approach directly targets the coupled-validation problem that shows up when an LLM hallucinates matching flaws in model and testbench, which is a practical pain point in this domain. The authors also ship concrete component names and a workflow diagram, so the method is at least reproducible enough to try on new modules. The main weakness is that the central reliability claim rests on the arbiter's ability to correct correlated errors without creating new ones. The abstract describes simultaneous rectification but gives no ablation that isolates the arbiter step, no independent oracle check on the final pair, and no breakdown of the 5% failures. The benchmark is labeled diverse, yet we see no module-size metrics, no comparison to industrial-scale blocks, and no analysis of whether the margin over static baselines shrinks on harder cases. Without those, the 95% figure could reflect easier modules where correlation is rare rather than a general fix. This work is aimed at hardware-verification engineers and LLM-tool builders who already experiment with multi-agent code generation. It is worth sending to peer review because the engineering problem is real, the proposed components are distinct, and the metrics are specific enough that referees can ask for the missing controls and still get a usable paper after revision.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces RefEvo, a dynamic multi-agent framework for generating SystemC reference models for SoC designs. It proposes a Dynamic Design Planner for adaptive workflow construction, a Co-Evolutionary Verification Mechanism featuring a Dialectical Arbiter to simultaneously correct generated models and testbenches against a specification oracle, and a Spec Anchoring Strategy for context compression. On a benchmark of 20 hardware modules, it reports a 95% pass rate outperforming static baselines, along with 71.04% average reduction in token consumption while preserving 100% specification recall.

Significance. If the empirical results hold under rigorous validation, RefEvo could advance automated hardware design by providing a more reliable and efficient alternative to static LLM workflows for reference modeling. The co-evolutionary approach addresses a critical issue in LLM-generated verification artifacts. The work is primarily empirical.

major comments (2)

[Co-Evolutionary Verification Mechanism] The Co-Evolutionary Verification Mechanism (described in the method section) claims the Dialectical Arbiter simultaneously rectifies both the model and testbench against the Spec oracle to mitigate Coupled Validation Failure, but provides no formal argument, ablation study, or independent oracle check demonstrating that the correction step does not introduce new correlated hallucinations via shared LLM context.
[Evaluation] The evaluation reports a 95% pass rate on 20 hardware modules with a large margin over static baselines and 71.04% token savings, but supplies no information on baseline definitions, statistical tests, module selection criteria, complexity metrics, or failure-mode analysis, leaving the central reliability claim unevaluable.

minor comments (1)

[Abstract] The abstract introduces 'Coupled Validation Failure' and 'Dialectical Arbiter' without a concise definition or pointer to the detailed mechanism in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and describe the revisions that will be incorporated.

read point-by-point responses

Referee: [Co-Evolutionary Verification Mechanism] The Co-Evolutionary Verification Mechanism (described in the method section) claims the Dialectical Arbiter simultaneously rectifies both the model and testbench against the Spec oracle to mitigate Coupled Validation Failure, but provides no formal argument, ablation study, or independent oracle check demonstrating that the correction step does not introduce new correlated hallucinations via shared LLM context.

Authors: The Dialectical Arbiter uses the specification oracle as an independent ground truth to guide simultaneous corrections to the model and testbench, with explicit context separation in the prompting to reduce shared hallucination risk. While the current manuscript provides no formal mathematical argument (the system is heuristic and LLM-driven) or dedicated ablation, we will add both an ablation study (full co-evolution vs. arbiter-ablated and shared-context variants) and quantitative analysis of correlated error rates in the revised version to empirically demonstrate the mechanism's effect. revision: yes
Referee: [Evaluation] The evaluation reports a 95% pass rate on 20 hardware modules with a large margin over static baselines and 71.04% token savings, but supplies no information on baseline definitions, statistical tests, module selection criteria, complexity metrics, or failure-mode analysis, leaving the central reliability claim unevaluable.

Authors: We agree that the evaluation section lacks sufficient detail for independent assessment. In the revised manuscript we will expand it to define the static baselines explicitly, report statistical significance tests on the pass-rate differences, specify module selection criteria and complexity metrics (e.g., spec length, port count), and include a failure-mode analysis of the non-passing cases. The benchmark modules and prompts will also be released. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of agentic framework

full rationale

The paper describes an empirical multi-agent system (RefEvo) with three named components and reports benchmark results (95% pass rate on 20 modules, 71.04% token savings) without any mathematical derivation chain, equations, fitted parameters, or first-principles predictions. Performance claims rest on external evaluation against a specification oracle rather than reducing to the method's own inputs by construction. No self-citations appear as load-bearing premises, and the central claims are falsifiable via the stated benchmark rather than self-referential.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review limits visibility into assumptions; the framework presupposes that LLMs can be orchestrated into reliable hardware modeling and that the introduced agent roles function as claimed.

axioms (1)

domain assumption LLMs are capable of generating hardware models when properly orchestrated
The entire approach rests on this premise stated in the opening of the abstract.

invented entities (2)

Dialectical Arbiter no independent evidence
purpose: Simultaneously rectifies the model and verification logic against the Spec oracle to mitigate Coupled Validation Failure
New component introduced as the core of the Co-Evolutionary Verification Mechanism; no independent evidence supplied.
Dynamic Design Planner no independent evidence
purpose: Autonomously decomposes specifications and builds tailored workflows based on semantic complexity
New component for adapting to varying design difficulty; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5593 in / 1483 out tokens · 69442 ms · 2026-05-08T03:11:55.102477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Design and verification of systemc transaction- level models,

A. Habibi and S. Tahar, “Design and verification of systemc transaction- level models,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 14, no. 1, pp. 57–68, 2006

work page 2006
[2]

Hardware design and verification with large language models: A scoping review, challenges, and open issues,

M. Abdollahi, S. F. Yeganli, M. Baharloo, and A. Baniasadi, “Hardware design and verification with large language models: A scoping review, challenges, and open issues,”Electronics (2079-9292), vol. 14, no. 1, 2025

work page 2079
[3]

Chatmodel: Automating reference model design and verification with llms,

J. Ye, T. Liu, Q. Tian, S. Su, Z. Jiang, and X. Wang, “Chatmodel: Automating reference model design and verification with llms,” 2025. [Online]. Available: https://arxiv.org/abs/2506.15066

work page arXiv 2025
[4]

Different reference models for uvm environment to speed up the verification time,

A. Moursi, R. Samhoud, Y . Kamal, M. Magdy, S. El-Ashry, and A. Shalaby, “Different reference models for uvm environment to speed up the verification time,” in2018 19th International Workshop on Microprocessor and SOC Test and Verification (MTV), 2018, pp. 67– 72

work page 2018
[5]

Trends in functional verification: A 2014 industry study,

H. D. Foster, “Trends in functional verification: A 2014 industry study,” IEEE, 2015

work page 2014
[6]

Case study: Re-visiting soc verification challenges and best practices,

P. Ghosh, S. Ghosh, P. Singh, and S. Mishra, “Case study: Re-visiting soc verification challenges and best practices,”IEEE, 2015

work page 2015
[7]

Liu, T.-D

M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktarogluet al., “Chipnemo: Domain- adapted llms for chip design,”arXiv preprint arXiv:2311.00176, 2023

work page arXiv 2023
[8]

LLM-assisted circuit verification: A comprehensive survey,

H. Liu, Y . Lu, M. Wang, X. Yao, and B. Yu, “LLM-assisted circuit verification: A comprehensive survey,” inProceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC), Hong Kong, 2026

work page 2026
[9]

Rtlcoder: Fully open- source and efficient llm-assisted rtl code generation technique,

S. Liu, W. Xiao, Y . Li, D. Z. Pan, and Z. Hu, “Rtlcoder: Fully open- source and efficient llm-assisted rtl code generation technique,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, arXiv:2312.08617

work page arXiv 2024
[10]

Chatchisel: Enabling agile hardware de- sign with large language models,

T. Liu, Q. Tian, J. Ye, L. Fu, S. Su, J. Li, G.-W. Wan, L. Zhang, S.-Z. Wong, X. Wang, and J. Yang, “Chatchisel: Enabling agile hardware de- sign with large language models,” in2024 2nd International Symposium of Electronics Design Automation (ISEDA), 2024, pp. 710–716

work page 2024
[11]

Rechisel: Effective automatic chisel code generation by llm with reflection,

J. Niu, X. Liu, D. Niu, X. Wang, Z. Jiang, and N. Guan, “Rechisel: Effective automatic chisel code generation by llm with reflection,” in 2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

work page 2025
[12]

Chatcpu: An agile cpu design and verification platform with llm,

X. Wang, G.-W. Wan, S.-Z. Wong, L. Zhang, T. Liu, Q. Tian, and J. Ye, “Chatcpu: An agile cpu design and verification platform with llm,” in Proceedings of the 61st ACM/IEEE Design Automation Conference, ser. DAC ’24. New York, NY , USA: Association for Computing Machinery,

work page
[13]

Available: https://doi.org/10.1145/3649329.3658493

[Online]. Available: https://doi.org/10.1145/3649329.3658493

work page doi:10.1145/3649329.3658493
[14]

Benchmarking large language models for auto- mated verilog rtl code generation,

S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan- Gavitt, and S. Garg, “Benchmarking large language models for auto- mated verilog rtl code generation,”arXiv preprint arXiv:2212.11140, 2022

work page arXiv 2022
[15]

Towards formal verification of real-world systemc tlm peripheral models - a case study,

H. M. Le, V . Herdt, D. Große, and R. Drechsler, “Towards formal verification of real-world systemc tlm peripheral models - a case study,” IEEE, 2016

work page 2016
[16]

Large language model-aware in-context learning for code generation,

J. Li, G. Li, C. Tao, J. Li, H. Zhang, F. Liu, and Z. Jin, “Large language model-aware in-context learning for code generation,” 2023

work page 2023
[17]

The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning,

S. Kim, S. Joo, D. Kim, J. Jang, S. Ye, J. Shin, and M. Seo, “The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning,”Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[18]

A comprehensive investigation of universal verification methodology (uvm) standard for design verification,

S. Qamar, W. H. Butt, M. W. Anwar, F. Azam, and M. Khan, “A comprehensive investigation of universal verification methodology (uvm) standard for design verification,”Proceedings of the 2020 9th International Conference on Software and Computer Applications, pp. 339–343, 2020

work page 2020
[19]

Chattest: Coverage-enhanced testbench gener- ation for agile hardware verification with llms,

G.-W. Wan, S. Su, J. Zhang, S. Z. Wong, M. Xing, L. Ji, Z. Jiang, X. Wang, and J. Yang, “Chattest: Coverage-enhanced testbench gener- ation for agile hardware verification with llms,” inProceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE). Verona, Italy: IEEE/ACM, Apr. 2026, pp. 1–7, hal-05482572

work page 2026
[20]

FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification,

G.-W. Wan, S. Su, R. Wang, Q. Chen, S.-Z. Wong, M. Xing, H. Feng, Y . Wang, Y . Zhu, J. Zhang, J. Ye, X. Wan, T. Ni, Q. Xu, N. Guan, Z. Jiang, X. Wang, and J. Yang, “Fixme: Towards end-to-end benchmarking of LLM-aided design verification,” inProceedings of the Fourtieth AAAI Conference on Artificial Intelligence, ser. AAAI ’26, 2026, to appear. [Online]. ...

work page arXiv 2026
[21]

Uvllm: An automated universal rtl verification framework using llms,

Y . Hu, J. Ye, K. Xu, J. Sun, S. Zhang, X. Jiao, D. Pan, J. Zhou, N. Wang, and W. Shan, “Uvllm: An automated universal rtl verification framework using llms,” 2024

work page 2024
[22]

Effec- tive processor verification with logic fuzzer enhanced co-simulation,

N. Kabylkas, T. Thorn, S. Srinath, P. Xekalakis, and J. Renau, “Effec- tive processor verification with logic fuzzer enhanced co-simulation,” MICRO-54: 54th Annual IEEE/ACM International Symposium on Mi- croarchitecture, pp. 667–678, 2021

work page 2021
[23]

Self-planning code generation with large language models,

X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,” 2023

work page 2023
[24]

ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications,

C. Xing, S. Wong, X. Wan, Y . Lu, M. Zhang, Z. Ma, L. Qi, Z. Li, N. Guan, Z. Jiang, X. Wang, and J. Yang, “ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications,” inProceedings of the Fourtieth AAAI Conference on Artificial Intelligence, 2026, to appear. [Online]. Available: https://arxiv.org/abs/2512.05371

work page arXiv 2026
[25]

Chatsva: Bridging sva generation for hardware verification via task-specific llms,

L. T. Fu, J. Zhou, S. Ren, M. Zhang, J. Xiong, H. Jiang, N. Guan, X. Wang, and J. Yang, “Chatsva: Bridging sva generation for hardware verification via task-specific llms,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287121296

work page 2026
[26]

Agentmesh: A cooperative multi-agent generative ai framework for software development automation,

S. Khanzadeh, “Agentmesh: A cooperative multi-agent generative ai framework for software development automation,” 2025

work page 2025
[27]

Divergent thoughts toward one goal: LLM-based multi-agent collaboration system for electronic design automation,

H. Wu, H. Zheng, Z. He, and B. Yu, “Divergent thoughts toward one goal: LLM-based multi-agent collaboration system for electronic design automation,” inProceedings of the Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Albuquerque, New Mexico, 2025

work page 2025
[28]

Guided code generation with llms: A multi-agent framework for complex code tasks,

A. Almorsi, M. Ahmed, and W. Gomaa, “Guided code generation with llms: A multi-agent framework for complex code tasks,” 2025

work page 2025
[29]

idse: Navigating design space exploration in high-level synthesis using llms,

R. Li, J. Xiong, and X. Wang, “idse: Navigating design space exploration in high-level synthesis using llms,”ArXiv, vol. abs/2505.22086,

work page arXiv
[30]

Available: https://api.semanticscholar.org/CorpusID: 278959926

[Online]. Available: https://api.semanticscholar.org/CorpusID: 278959926

work page
[31]

A comprehensive survey of ai-driven advancements and techniques in automated program repair and code generation,

A. Anand, A. Gupta, N. Yadav, and S. Bajaj, “A comprehensive survey of ai-driven advancements and techniques in automated program repair and code generation,” 2024

work page 2024
[32]

Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt.arXiv preprint arXiv:2304.00385, 2023

C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2024. [Online]. Available: https://arxiv.org/abs/2304.00385

work page arXiv 2024
[33]

Towards practical and useful automated program repair for debugging,

Q. Xin, H. Wu, S. P. Reiss, and J. Xuan, “Towards practical and useful automated program repair for debugging,” 2024

work page 2024
[34]

Tdag: A multi-agent framework based on dynamic task decomposition and agent generation,

Y . Wang, Z. Wu, J. Yao, and J. Su, “Tdag: A multi-agent framework based on dynamic task decomposition and agent generation,”Neural Networks, vol. 185, no. 000, 2025

work page 2025
[35]

Advancing agentic systems: Dynamic task decomposition, tool integration and evaluation using novel metrics and dataset,

A. G. Gabriel, A. A. Ahmad, and S. K. Jeyakumar, “Advancing agentic systems: Dynamic task decomposition, tool integration and evaluation using novel metrics and dataset,” 2024

work page 2024
[36]

OpenCores: Open source hardware IP core community,

OpenCores Community, “OpenCores: Open source hardware IP core community,” 2025, accessed: 2025. [Online]. Available: https: //opencores.org/

work page 2025
[37]

Genben: A generative benchmark for LLM-aided design,

G.-W. Wan, Y . Wang, S. Wong, J. Zhang, M. Xing, Z. Jiang, N. Guan, Y . Wang, N. Xu, Q. Xu, and X. Wang, “Genben: A generative benchmark for LLM-aided design,” 2025. [Online]. Available: https://openreview.net/forum?id=gtV o4xcpFI

work page 2025
[38]

XuanTie open source RISC-V project,

T-Head Semiconductor, “XuanTie open source RISC-V project,” 2025, accessed: 2025. [Online]. Available: https://github.com/T-Head-Semi

work page 2025

[1] [1]

Design and verification of systemc transaction- level models,

A. Habibi and S. Tahar, “Design and verification of systemc transaction- level models,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 14, no. 1, pp. 57–68, 2006

work page 2006

[2] [2]

Hardware design and verification with large language models: A scoping review, challenges, and open issues,

M. Abdollahi, S. F. Yeganli, M. Baharloo, and A. Baniasadi, “Hardware design and verification with large language models: A scoping review, challenges, and open issues,”Electronics (2079-9292), vol. 14, no. 1, 2025

work page 2079

[3] [3]

Chatmodel: Automating reference model design and verification with llms,

J. Ye, T. Liu, Q. Tian, S. Su, Z. Jiang, and X. Wang, “Chatmodel: Automating reference model design and verification with llms,” 2025. [Online]. Available: https://arxiv.org/abs/2506.15066

work page arXiv 2025

[4] [4]

Different reference models for uvm environment to speed up the verification time,

A. Moursi, R. Samhoud, Y . Kamal, M. Magdy, S. El-Ashry, and A. Shalaby, “Different reference models for uvm environment to speed up the verification time,” in2018 19th International Workshop on Microprocessor and SOC Test and Verification (MTV), 2018, pp. 67– 72

work page 2018

[5] [5]

Trends in functional verification: A 2014 industry study,

H. D. Foster, “Trends in functional verification: A 2014 industry study,” IEEE, 2015

work page 2014

[6] [6]

Case study: Re-visiting soc verification challenges and best practices,

P. Ghosh, S. Ghosh, P. Singh, and S. Mishra, “Case study: Re-visiting soc verification challenges and best practices,”IEEE, 2015

work page 2015

[7] [7]

Liu, T.-D

M. Liu, T.-D. Ene, R. Kirby, C. Cheng, N. Pinckney, R. Liang, J. Alben, H. Anand, S. Banerjee, I. Bayraktarogluet al., “Chipnemo: Domain- adapted llms for chip design,”arXiv preprint arXiv:2311.00176, 2023

work page arXiv 2023

[8] [8]

LLM-assisted circuit verification: A comprehensive survey,

H. Liu, Y . Lu, M. Wang, X. Yao, and B. Yu, “LLM-assisted circuit verification: A comprehensive survey,” inProceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC), Hong Kong, 2026

work page 2026

[9] [9]

Rtlcoder: Fully open- source and efficient llm-assisted rtl code generation technique,

S. Liu, W. Xiao, Y . Li, D. Z. Pan, and Z. Hu, “Rtlcoder: Fully open- source and efficient llm-assisted rtl code generation technique,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, arXiv:2312.08617

work page arXiv 2024

[10] [10]

Chatchisel: Enabling agile hardware de- sign with large language models,

T. Liu, Q. Tian, J. Ye, L. Fu, S. Su, J. Li, G.-W. Wan, L. Zhang, S.-Z. Wong, X. Wang, and J. Yang, “Chatchisel: Enabling agile hardware de- sign with large language models,” in2024 2nd International Symposium of Electronics Design Automation (ISEDA), 2024, pp. 710–716

work page 2024

[11] [11]

Rechisel: Effective automatic chisel code generation by llm with reflection,

J. Niu, X. Liu, D. Niu, X. Wang, Z. Jiang, and N. Guan, “Rechisel: Effective automatic chisel code generation by llm with reflection,” in 2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

work page 2025

[12] [12]

Chatcpu: An agile cpu design and verification platform with llm,

X. Wang, G.-W. Wan, S.-Z. Wong, L. Zhang, T. Liu, Q. Tian, and J. Ye, “Chatcpu: An agile cpu design and verification platform with llm,” in Proceedings of the 61st ACM/IEEE Design Automation Conference, ser. DAC ’24. New York, NY , USA: Association for Computing Machinery,

work page

[13] [13]

Available: https://doi.org/10.1145/3649329.3658493

[Online]. Available: https://doi.org/10.1145/3649329.3658493

work page doi:10.1145/3649329.3658493

[14] [14]

Benchmarking large language models for auto- mated verilog rtl code generation,

S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan- Gavitt, and S. Garg, “Benchmarking large language models for auto- mated verilog rtl code generation,”arXiv preprint arXiv:2212.11140, 2022

work page arXiv 2022

[15] [15]

Towards formal verification of real-world systemc tlm peripheral models - a case study,

H. M. Le, V . Herdt, D. Große, and R. Drechsler, “Towards formal verification of real-world systemc tlm peripheral models - a case study,” IEEE, 2016

work page 2016

[16] [16]

Large language model-aware in-context learning for code generation,

J. Li, G. Li, C. Tao, J. Li, H. Zhang, F. Liu, and Z. Jin, “Large language model-aware in-context learning for code generation,” 2023

work page 2023

[17] [17]

The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning,

S. Kim, S. Joo, D. Kim, J. Jang, S. Ye, J. Shin, and M. Seo, “The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning,”Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[18] [18]

A comprehensive investigation of universal verification methodology (uvm) standard for design verification,

S. Qamar, W. H. Butt, M. W. Anwar, F. Azam, and M. Khan, “A comprehensive investigation of universal verification methodology (uvm) standard for design verification,”Proceedings of the 2020 9th International Conference on Software and Computer Applications, pp. 339–343, 2020

work page 2020

[19] [19]

Chattest: Coverage-enhanced testbench gener- ation for agile hardware verification with llms,

G.-W. Wan, S. Su, J. Zhang, S. Z. Wong, M. Xing, L. Ji, Z. Jiang, X. Wang, and J. Yang, “Chattest: Coverage-enhanced testbench gener- ation for agile hardware verification with llms,” inProceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE). Verona, Italy: IEEE/ACM, Apr. 2026, pp. 1–7, hal-05482572

work page 2026

[20] [20]

FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification,

G.-W. Wan, S. Su, R. Wang, Q. Chen, S.-Z. Wong, M. Xing, H. Feng, Y . Wang, Y . Zhu, J. Zhang, J. Ye, X. Wan, T. Ni, Q. Xu, N. Guan, Z. Jiang, X. Wang, and J. Yang, “Fixme: Towards end-to-end benchmarking of LLM-aided design verification,” inProceedings of the Fourtieth AAAI Conference on Artificial Intelligence, ser. AAAI ’26, 2026, to appear. [Online]. ...

work page arXiv 2026

[21] [21]

Uvllm: An automated universal rtl verification framework using llms,

Y . Hu, J. Ye, K. Xu, J. Sun, S. Zhang, X. Jiao, D. Pan, J. Zhou, N. Wang, and W. Shan, “Uvllm: An automated universal rtl verification framework using llms,” 2024

work page 2024

[22] [22]

Effec- tive processor verification with logic fuzzer enhanced co-simulation,

N. Kabylkas, T. Thorn, S. Srinath, P. Xekalakis, and J. Renau, “Effec- tive processor verification with logic fuzzer enhanced co-simulation,” MICRO-54: 54th Annual IEEE/ACM International Symposium on Mi- croarchitecture, pp. 667–678, 2021

work page 2021

[23] [23]

Self-planning code generation with large language models,

X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,” 2023

work page 2023

[24] [24]

ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications,

C. Xing, S. Wong, X. Wan, Y . Lu, M. Zhang, Z. Ma, L. Qi, Z. Li, N. Guan, Z. Jiang, X. Wang, and J. Yang, “ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications,” inProceedings of the Fourtieth AAAI Conference on Artificial Intelligence, 2026, to appear. [Online]. Available: https://arxiv.org/abs/2512.05371

work page arXiv 2026

[25] [25]

Chatsva: Bridging sva generation for hardware verification via task-specific llms,

L. T. Fu, J. Zhou, S. Ren, M. Zhang, J. Xiong, H. Jiang, N. Guan, X. Wang, and J. Yang, “Chatsva: Bridging sva generation for hardware verification via task-specific llms,” 2026. [Online]. Available: https://api.semanticscholar.org/CorpusID:287121296

work page 2026

[26] [26]

Agentmesh: A cooperative multi-agent generative ai framework for software development automation,

S. Khanzadeh, “Agentmesh: A cooperative multi-agent generative ai framework for software development automation,” 2025

work page 2025

[27] [27]

Divergent thoughts toward one goal: LLM-based multi-agent collaboration system for electronic design automation,

H. Wu, H. Zheng, Z. He, and B. Yu, “Divergent thoughts toward one goal: LLM-based multi-agent collaboration system for electronic design automation,” inProceedings of the Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), Albuquerque, New Mexico, 2025

work page 2025

[28] [28]

Guided code generation with llms: A multi-agent framework for complex code tasks,

A. Almorsi, M. Ahmed, and W. Gomaa, “Guided code generation with llms: A multi-agent framework for complex code tasks,” 2025

work page 2025

[29] [29]

idse: Navigating design space exploration in high-level synthesis using llms,

R. Li, J. Xiong, and X. Wang, “idse: Navigating design space exploration in high-level synthesis using llms,”ArXiv, vol. abs/2505.22086,

work page arXiv

[30] [30]

Available: https://api.semanticscholar.org/CorpusID: 278959926

[Online]. Available: https://api.semanticscholar.org/CorpusID: 278959926

work page

[31] [31]

A comprehensive survey of ai-driven advancements and techniques in automated program repair and code generation,

A. Anand, A. Gupta, N. Yadav, and S. Bajaj, “A comprehensive survey of ai-driven advancements and techniques in automated program repair and code generation,” 2024

work page 2024

[32] [32]

Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt.arXiv preprint arXiv:2304.00385, 2023

C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair via conversation: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2024. [Online]. Available: https://arxiv.org/abs/2304.00385

work page arXiv 2024

[33] [33]

Towards practical and useful automated program repair for debugging,

Q. Xin, H. Wu, S. P. Reiss, and J. Xuan, “Towards practical and useful automated program repair for debugging,” 2024

work page 2024

[34] [34]

Tdag: A multi-agent framework based on dynamic task decomposition and agent generation,

Y . Wang, Z. Wu, J. Yao, and J. Su, “Tdag: A multi-agent framework based on dynamic task decomposition and agent generation,”Neural Networks, vol. 185, no. 000, 2025

work page 2025

[35] [35]

Advancing agentic systems: Dynamic task decomposition, tool integration and evaluation using novel metrics and dataset,

A. G. Gabriel, A. A. Ahmad, and S. K. Jeyakumar, “Advancing agentic systems: Dynamic task decomposition, tool integration and evaluation using novel metrics and dataset,” 2024

work page 2024

[36] [36]

OpenCores: Open source hardware IP core community,

OpenCores Community, “OpenCores: Open source hardware IP core community,” 2025, accessed: 2025. [Online]. Available: https: //opencores.org/

work page 2025

[37] [37]

Genben: A generative benchmark for LLM-aided design,

G.-W. Wan, Y . Wang, S. Wong, J. Zhang, M. Xing, Z. Jiang, N. Guan, Y . Wang, N. Xu, Q. Xu, and X. Wang, “Genben: A generative benchmark for LLM-aided design,” 2025. [Online]. Available: https://openreview.net/forum?id=gtV o4xcpFI

work page 2025

[38] [38]

XuanTie open source RISC-V project,

T-Head Semiconductor, “XuanTie open source RISC-V project,” 2025, accessed: 2025. [Online]. Available: https://github.com/T-Head-Semi

work page 2025