Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

Nathaniel Pinckney; Zijian Du

arxiv: 2605.21810 · v1 · pith:TVVUULYSnew · submitted 2026-05-20 · 💻 cs.AI · cs.MA

Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

Zijian Du , Nathaniel Pinckney This is my paper

Pith reviewed 2026-05-22 08:35 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords Trace2SkillCVDPhardware agentsskill evolutionverifier feedbacktest-time scalingVerilog designEDA

0 comments

The pith

Trace2Skill evolves an agent's natural-language skills from rollout traces using verifier feedback to solve complex Verilog design problems without model updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Trace2Skill, a framework that treats an LLM agent's skills as an evolvable policy for tackling Complex Verilog Design Problems. It analyzes repeated rollout traces to identify success and failure modes, converts these into dense diagnostics and oracle lessons, and employs an oracle-mutator-selector loop to generate task-specific skills. These skills then guide the agent's search, editing, validation, and recovery processes. This approach connects skill descriptions directly to verifier evidence and behavior, leading to improved performance on difficult tasks that stump both the initial agent and advanced coding models. Importantly, it achieves these gains through test-time scaling alone, without any fine-tuning data, specialized training, or changes to model weights, and suggests applicability to other verifiable EDA tasks.

Core claim

Trace2Skill is a test-time scaling framework that improves hardware LLM agents on Complex Verilog Design Problems by mining repeated rollout traces for success and failure modes, converting them into dense diagnostics and oracle lessons, and using an oracle, mutator, and selector loop to produce task-specific skills that guide later actions, supported by bounded runtime dense verifier feedback that provides sanitized functional observations.

What carries the argument

The oracle-mutator-selector loop that evolves task-specific skills from traces and dense verifier feedback, serving as the mechanism to connect skill text to evidence and behavior.

Load-bearing premise

Repeated rollout traces contain extractable success and failure modes that can be reliably converted into dense diagnostics and oracle lessons capable of guiding effective skill evolution.

What would settle it

Running the Trace2Skill process on the same set of hard CVDP tasks and observing no increase in pass rates or no new solutions on previously unsolved tasks compared to the base agent.

Figures

Figures reproduced from arXiv: 2605.21810 by Nathaniel Pinckney, Zijian Du.

**Figure 1.** Figure 1: Trace2Skill end-to-end flow. Colors group component roles as shown in the legend; solid arrows mark the main [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: AgentQ proxy on 96 completed OSS seed-skill base [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: C3 versus C4 quality dynamics on the 8 hard CVDP [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Task-level C3/C4 verifier outcomes on the 8 hard [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Tree view of the repo-relative Trace2Skill implementation and representative run artifacts. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: NeMo Gym rollout sequence for one submitted run. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Trace2Skill frames a test-time loop that mines agent traces into evolvable skills for hard Verilog tasks, but the abstract supplies no numbers or ablations to show the loop actually transfers.

read the letter

The main thing here is a test-time procedure that turns repeated rollout traces into natural-language skills via an oracle-mutator-selector cycle, plus dense verifier feedback that gives sanitized observations without exposing hidden harnesses. That setup is presented as an alternative to fine-tuning or just sampling more, aimed at long-context EDA problems where agents must localize RTL, includes, and dependencies inside big repos. The paper does a clean job laying out why coarse pass/fail labels are insufficient for these tasks and why connecting skill text to verifier evidence could help guide edits and recovery. The framing as a general strategy that might extend past digital design is also straightforward. The soft spot is that the central performance claims rest on the assumption that traces contain extractable, reusable success and failure modes that the loop can turn into transferable guidance. In large Verilog repositories the traces are likely full of sparse signals around build paths and hidden failures, and without details on skill representation or mutation operators it is not clear the method avoids producing brittle, task-specific patches instead. The abstract states substantial gains and breakthrough passes on previously unsolved tasks, yet gives no quantitative results, error bars, dataset sizes, or ablation data, so the improvement cannot be checked from what is shown. The stress-test note about non-generalizable diagnostics looks like it lands. This is for people working on agentic hardware design or test-time scaling for verifiable code tasks. A reader who wants to see whether the skill evolution actually works on real CVDP benchmarks would get value from the full methods and results. It deserves a serious referee to evaluate the implementation and the numbers.

Referee Report

3 major / 2 minor

Summary. The paper introduces Trace2Skill, a test-time scaling framework for LLM-based hardware agents tackling Complex Verilog Design Problems (CVDP). It mines repeated rollout traces to extract success and failure modes, converts them into dense diagnostics and oracle lessons via an oracle-mutator-selector loop, and augments this with bounded runtime dense verifier feedback to guide skill evolution, editing, validation, and recovery. The central claim is that this approach substantially raises task pass rates on hard CVDP instances that defeat both the seed agent and frontier coding models, including breakthrough solutions on previously unsolved tasks, all without fine-tuning data, RTL-specialized training, or model weight updates. The framework is positioned as a general strategy extensible to other verifiable EDA tasks.

Significance. If the empirical results and transferability claims hold under scrutiny, the work offers a meaningful contribution to test-time scaling for long-context agents in hardware design. It demonstrates a path to improve performance on repository-scale Verilog problems by evolving natural-language skills from traces rather than relying on model updates, which could reduce dependence on high-quality fine-tuning corpora in EDA domains. The use of dense verifier feedback to connect skill text with functional observations is a practical mechanism worth further exploration.

major comments (3)

[Abstract] Abstract: the central claim of substantial pass-rate improvements and breakthrough passes on unsolved tasks is stated without any quantitative results, error bars, ablation studies, dataset statistics, or baseline comparisons. This absence makes it impossible to assess effect sizes, statistical significance, or reproducibility from the provided text.
[Framework description] Framework description (inferred §3): the skill evolution procedure assumes that repeated rollout traces contain extractable, transferable success/failure modes that the oracle-mutator-selector can reliably convert into reusable natural-language skills. No explicit mechanisms, skill representation format, mutation operators, or selection criteria are detailed to ensure abstraction beyond local, task-specific patches (e.g., include paths or build dependencies) in large Verilog repositories.
[Evaluation section] Evaluation section (inferred §4): the claim that Trace2Skill succeeds on tasks defeating frontier coding agents rests on the untested assumption that dense verifier feedback produces generalizable diagnostics rather than brittle, non-transferable guidance. Without ablations isolating the contribution of the mutation/selection loop versus simple trace replay, the load-bearing role of skill evolution cannot be verified.

minor comments (2)

[Method] Clarify the exact format in which evolved skills are stored and injected into the agent's prompt or policy at inference time.
[Framework] Add a diagram or pseudocode for the oracle-mutator-selector loop to improve readability of the iterative process.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional clarity and evidence would strengthen the manuscript. We address each major comment below and commit to revisions that improve transparency without altering the core claims or experimental setup.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of substantial pass-rate improvements and breakthrough passes on unsolved tasks is stated without any quantitative results, error bars, ablation studies, dataset statistics, or baseline comparisons. This absence makes it impossible to assess effect sizes, statistical significance, or reproducibility from the provided text.

Authors: We agree that the abstract should convey key quantitative outcomes to enable immediate assessment of effect sizes. In the revised version we will add concise statements of the main pass-rate gains (including the number of tasks and comparison to seed and frontier baselines) while preserving the abstract's brevity; full statistics, error bars, ablations, and dataset details will remain in the evaluation section and supplementary material. revision: yes
Referee: [Framework description] Framework description (inferred §3): the skill evolution procedure assumes that repeated rollout traces contain extractable, transferable success/failure modes that the oracle-mutator-selector can reliably convert into reusable natural-language skills. No explicit mechanisms, skill representation format, mutation operators, or selection criteria are detailed to ensure abstraction beyond local, task-specific patches (e.g., include paths or build dependencies) in large Verilog repositories.

Authors: Section 3 already specifies the oracle-mutator-selector loop, the natural-language skill format (structured diagnostic patterns plus recovery heuristics), and the selection criterion (empirical success on held-out validation rollouts). To make these elements fully explicit and to demonstrate abstraction beyond local patches, we will insert a detailed algorithm box and additional prose describing the mutation operators (e.g., generalization from concrete fixes to reusable diagnostic templates) and the abstraction mechanisms used to avoid repository-specific artifacts. revision: yes
Referee: [Evaluation section] Evaluation section (inferred §4): the claim that Trace2Skill succeeds on tasks defeating frontier coding agents rests on the untested assumption that dense verifier feedback produces generalizable diagnostics rather than brittle, non-transferable guidance. Without ablations isolating the contribution of the mutation/selection loop versus simple trace replay, the load-bearing role of skill evolution cannot be verified.

Authors: The current evaluation already compares Trace2Skill against the seed agent and frontier models on the same hard CVDP tasks. We acknowledge that an explicit ablation separating the full oracle-mutator-selector loop from simple trace replay would further isolate the contribution of skill evolution. We will add this ablation in the revised manuscript and will also report cross-task transfer results to address concerns about diagnostic brittleness. These additions will be presented as new experiments rather than reinterpretation of existing data. revision: partial

Circularity Check

0 steps flagged

No circularity: procedural framework with no equations or self-referential reductions

full rationale

The paper presents Trace2Skill as a test-time scaling procedure that mines rollout traces, converts modes into diagnostics via an oracle-mutator-selector loop, and applies dense verifier feedback to evolve skills for CVDP tasks. No mathematical derivations, equations, or parameter-fitting steps appear in the abstract or description. Claims of improved pass rates rest on empirical application to hard tasks rather than any chain that reduces predictions to inputs by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced. The method is self-contained as an algorithmic recipe whose validity is intended to be assessed externally through task performance, not internal definitional closure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that sanitized verifier observations can be turned into actionable lessons without access to hidden harnesses or reference solutions.

free parameters (1)

skill mutation and selection parameters
Number of mutations, selection criteria, and feedback density thresholds are not quantified in the abstract but are required for the oracle-mutator-selector loop.

axioms (1)

domain assumption Bounded runtime dense verifier feedback can be sanitized to provide functional observations while keeping hidden harnesses inaccessible.
Invoked when describing how feedback guides skill evolution without exposing reference solutions.

pith-pipeline@v0.9.0 · 5802 in / 1247 out tokens · 78401 ms · 2026-05-22T08:35:39.325452+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trace2Skill treats the agent’s natural-language skill as an evolvable policy. It mines repeated rollout traces for both success modes and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle–mutator–selector loop
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SelectQ(𝑆) = PassRate(𝑆) + 𝜖𝑈(𝑆) with dense metrics SkillQ, AgentProgressQ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

work page 2023
[2]

Reflexion: Language agents with verbal reinforce- ment learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforce- ment learning. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[3]

Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

work page 2024
[4]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[5]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InConference on Language Modeling, 2024

work page 2024
[6]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, and Chenglin Wu. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024

work page 2024
[7]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kai Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024

work page 2024
[8]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

work page 2024
[9]

Verilogeval: Evaluating large language models for verilog code generation

Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. Verilogeval: Evaluating large language models for verilog code generation. InIEEE/ACM International Conference on Computer-Aided Design, 2023

work page 2023
[10]

RTLLM: An open-source benchmark for design RTL generation with large language model

Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. RTLLM: An open-source benchmark for design RTL generation with large language model. InAsia and South Pacific Design Automation Conference, 2024

work page 2024
[11]

PyHDL-Eval: An LLM evaluation framework for hardware design using python-embedded DSLs

Christopher Batten, Nathaniel Pinckney, Mingjie Liu, Haoxing Ren, and Brucek Khailany. PyHDL-Eval: An LLM evaluation framework for hardware design using python-embedded DSLs. InACM/IEEE Symposium on Machine Learning for CAD, 2024. Trace2Skill

work page 2024
[12]

Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, 2025

Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, and Haoxing Ren. Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, 2025

work page 2025
[13]

ChipAlign: Instruction align- ment in large language models for chip design via geodesic interpolation

Chenhui Deng, Yunsheng Bai, and Haoxing Ren. ChipAlign: Instruction align- ment in large language models for chip design via geodesic interpolation. In ACM/IEEE Design Automation Conference, 2025

work page 2025
[14]

ScaleRTL: Scaling LLMs with reasoning data and test-time compute for accurate RTL code generation

Chenhui Deng, Yun-Da Tsai, Guan-Ting Liu, Zhongzhi Yu, and Haoxing Ren. ScaleRTL: Scaling LLMs with reasoning data and test-time compute for accurate RTL code generation. InACM/IEEE Symposium on Machine Learning for CAD, 2025

work page 2025
[15]

RTLFixer: Automatically fixing RTL syntax errors with large language models

Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. RTLFixer: Automatically fixing RTL syntax errors with large language models. InACM/IEEE Design Automation Conference, 2024

work page 2024
[16]

VerilogCoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree-based waveform tracing tool

Chia-Tung Ho, Haoxing Ren, and Brucek Khailany. VerilogCoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree-based waveform tracing tool. InAAAI Conference on Artificial Intelligence, 2025

work page 2025
[17]

ACE-RTL: When agentic context evolution meets RTL-specialized LLMs, 2026

Chenhui Deng, Zhongzhi Yu, Guan-Ting Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. ACE-RTL: When agentic context evolution meets RTL-specialized LLMs, 2026. Zijian Du and Nathaniel Pinckney A ARTIFACT APPENDIX This appendix gives compact, trace-grounded artifacts used by the Trace2Skill pipeline. Full raw traces remain in the experiment direc...

work page 2026

[1] [1]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

work page 2023

[2] [2]

Reflexion: Language agents with verbal reinforce- ment learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforce- ment learning. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[3] [3]

Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

work page 2024

[4] [4]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[5] [5]

White, Doug Burger, and Chi Wang

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InConference on Language Modeling, 2024

work page 2024

[6] [6]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, and Chenglin Wu. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024

work page 2024

[7] [7]

AgentBench: Evaluating LLMs as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kai Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024

work page 2024

[8] [8]

Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

work page 2024

[9] [9]

Verilogeval: Evaluating large language models for verilog code generation

Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. Verilogeval: Evaluating large language models for verilog code generation. InIEEE/ACM International Conference on Computer-Aided Design, 2023

work page 2023

[10] [10]

RTLLM: An open-source benchmark for design RTL generation with large language model

Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. RTLLM: An open-source benchmark for design RTL generation with large language model. InAsia and South Pacific Design Automation Conference, 2024

work page 2024

[11] [11]

PyHDL-Eval: An LLM evaluation framework for hardware design using python-embedded DSLs

Christopher Batten, Nathaniel Pinckney, Mingjie Liu, Haoxing Ren, and Brucek Khailany. PyHDL-Eval: An LLM evaluation framework for hardware design using python-embedded DSLs. InACM/IEEE Symposium on Machine Learning for CAD, 2024. Trace2Skill

work page 2024

[12] [12]

Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, 2025

Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, and Haoxing Ren. Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, 2025

work page 2025

[13] [13]

ChipAlign: Instruction align- ment in large language models for chip design via geodesic interpolation

Chenhui Deng, Yunsheng Bai, and Haoxing Ren. ChipAlign: Instruction align- ment in large language models for chip design via geodesic interpolation. In ACM/IEEE Design Automation Conference, 2025

work page 2025

[14] [14]

ScaleRTL: Scaling LLMs with reasoning data and test-time compute for accurate RTL code generation

Chenhui Deng, Yun-Da Tsai, Guan-Ting Liu, Zhongzhi Yu, and Haoxing Ren. ScaleRTL: Scaling LLMs with reasoning data and test-time compute for accurate RTL code generation. InACM/IEEE Symposium on Machine Learning for CAD, 2025

work page 2025

[15] [15]

RTLFixer: Automatically fixing RTL syntax errors with large language models

Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. RTLFixer: Automatically fixing RTL syntax errors with large language models. InACM/IEEE Design Automation Conference, 2024

work page 2024

[16] [16]

VerilogCoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree-based waveform tracing tool

Chia-Tung Ho, Haoxing Ren, and Brucek Khailany. VerilogCoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree-based waveform tracing tool. InAAAI Conference on Artificial Intelligence, 2025

work page 2025

[17] [17]

ACE-RTL: When agentic context evolution meets RTL-specialized LLMs, 2026

Chenhui Deng, Zhongzhi Yu, Guan-Ting Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. ACE-RTL: When agentic context evolution meets RTL-specialized LLMs, 2026. Zijian Du and Nathaniel Pinckney A ARTIFACT APPENDIX This appendix gives compact, trace-grounded artifacts used by the Trace2Skill pipeline. Full raw traces remain in the experiment direc...

work page 2026