Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents
Pith reviewed 2026-05-22 08:35 UTC · model grok-4.3
The pith
Trace2Skill evolves an agent's natural-language skills from rollout traces using verifier feedback to solve complex Verilog design problems without model updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Trace2Skill is a test-time scaling framework that improves hardware LLM agents on Complex Verilog Design Problems by mining repeated rollout traces for success and failure modes, converting them into dense diagnostics and oracle lessons, and using an oracle, mutator, and selector loop to produce task-specific skills that guide later actions, supported by bounded runtime dense verifier feedback that provides sanitized functional observations.
What carries the argument
The oracle-mutator-selector loop that evolves task-specific skills from traces and dense verifier feedback, serving as the mechanism to connect skill text to evidence and behavior.
Load-bearing premise
Repeated rollout traces contain extractable success and failure modes that can be reliably converted into dense diagnostics and oracle lessons capable of guiding effective skill evolution.
What would settle it
Running the Trace2Skill process on the same set of hard CVDP tasks and observing no increase in pass rates or no new solutions on previously unsolved tasks compared to the base agent.
Figures
read the original abstract
Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Trace2Skill, a test-time scaling framework for LLM-based hardware agents tackling Complex Verilog Design Problems (CVDP). It mines repeated rollout traces to extract success and failure modes, converts them into dense diagnostics and oracle lessons via an oracle-mutator-selector loop, and augments this with bounded runtime dense verifier feedback to guide skill evolution, editing, validation, and recovery. The central claim is that this approach substantially raises task pass rates on hard CVDP instances that defeat both the seed agent and frontier coding models, including breakthrough solutions on previously unsolved tasks, all without fine-tuning data, RTL-specialized training, or model weight updates. The framework is positioned as a general strategy extensible to other verifiable EDA tasks.
Significance. If the empirical results and transferability claims hold under scrutiny, the work offers a meaningful contribution to test-time scaling for long-context agents in hardware design. It demonstrates a path to improve performance on repository-scale Verilog problems by evolving natural-language skills from traces rather than relying on model updates, which could reduce dependence on high-quality fine-tuning corpora in EDA domains. The use of dense verifier feedback to connect skill text with functional observations is a practical mechanism worth further exploration.
major comments (3)
- [Abstract] Abstract: the central claim of substantial pass-rate improvements and breakthrough passes on unsolved tasks is stated without any quantitative results, error bars, ablation studies, dataset statistics, or baseline comparisons. This absence makes it impossible to assess effect sizes, statistical significance, or reproducibility from the provided text.
- [Framework description] Framework description (inferred §3): the skill evolution procedure assumes that repeated rollout traces contain extractable, transferable success/failure modes that the oracle-mutator-selector can reliably convert into reusable natural-language skills. No explicit mechanisms, skill representation format, mutation operators, or selection criteria are detailed to ensure abstraction beyond local, task-specific patches (e.g., include paths or build dependencies) in large Verilog repositories.
- [Evaluation section] Evaluation section (inferred §4): the claim that Trace2Skill succeeds on tasks defeating frontier coding agents rests on the untested assumption that dense verifier feedback produces generalizable diagnostics rather than brittle, non-transferable guidance. Without ablations isolating the contribution of the mutation/selection loop versus simple trace replay, the load-bearing role of skill evolution cannot be verified.
minor comments (2)
- [Method] Clarify the exact format in which evolved skills are stored and injected into the agent's prompt or policy at inference time.
- [Framework] Add a diagram or pseudocode for the oracle-mutator-selector loop to improve readability of the iterative process.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting areas where additional clarity and evidence would strengthen the manuscript. We address each major comment below and commit to revisions that improve transparency without altering the core claims or experimental setup.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of substantial pass-rate improvements and breakthrough passes on unsolved tasks is stated without any quantitative results, error bars, ablation studies, dataset statistics, or baseline comparisons. This absence makes it impossible to assess effect sizes, statistical significance, or reproducibility from the provided text.
Authors: We agree that the abstract should convey key quantitative outcomes to enable immediate assessment of effect sizes. In the revised version we will add concise statements of the main pass-rate gains (including the number of tasks and comparison to seed and frontier baselines) while preserving the abstract's brevity; full statistics, error bars, ablations, and dataset details will remain in the evaluation section and supplementary material. revision: yes
-
Referee: [Framework description] Framework description (inferred §3): the skill evolution procedure assumes that repeated rollout traces contain extractable, transferable success/failure modes that the oracle-mutator-selector can reliably convert into reusable natural-language skills. No explicit mechanisms, skill representation format, mutation operators, or selection criteria are detailed to ensure abstraction beyond local, task-specific patches (e.g., include paths or build dependencies) in large Verilog repositories.
Authors: Section 3 already specifies the oracle-mutator-selector loop, the natural-language skill format (structured diagnostic patterns plus recovery heuristics), and the selection criterion (empirical success on held-out validation rollouts). To make these elements fully explicit and to demonstrate abstraction beyond local patches, we will insert a detailed algorithm box and additional prose describing the mutation operators (e.g., generalization from concrete fixes to reusable diagnostic templates) and the abstraction mechanisms used to avoid repository-specific artifacts. revision: yes
-
Referee: [Evaluation section] Evaluation section (inferred §4): the claim that Trace2Skill succeeds on tasks defeating frontier coding agents rests on the untested assumption that dense verifier feedback produces generalizable diagnostics rather than brittle, non-transferable guidance. Without ablations isolating the contribution of the mutation/selection loop versus simple trace replay, the load-bearing role of skill evolution cannot be verified.
Authors: The current evaluation already compares Trace2Skill against the seed agent and frontier models on the same hard CVDP tasks. We acknowledge that an explicit ablation separating the full oracle-mutator-selector loop from simple trace replay would further isolate the contribution of skill evolution. We will add this ablation in the revised manuscript and will also report cross-task transfer results to address concerns about diagnostic brittleness. These additions will be presented as new experiments rather than reinterpretation of existing data. revision: partial
Circularity Check
No circularity: procedural framework with no equations or self-referential reductions
full rationale
The paper presents Trace2Skill as a test-time scaling procedure that mines rollout traces, converts modes into diagnostics via an oracle-mutator-selector loop, and applies dense verifier feedback to evolve skills for CVDP tasks. No mathematical derivations, equations, or parameter-fitting steps appear in the abstract or description. Claims of improved pass rates rest on empirical application to hard tasks rather than any chain that reduces predictions to inputs by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced. The method is self-contained as an algorithmic recipe whose validity is intended to be assessed externally through task performance, not internal definitional closure.
Axiom & Free-Parameter Ledger
free parameters (1)
- skill mutation and selection parameters
axioms (1)
- domain assumption Bounded runtime dense verifier feedback can be sanitized to provide functional observations while keeping hidden harnesses inaccessible.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Trace2Skill treats the agent’s natural-language skill as an evolvable policy. It mines repeated rollout traces for both success modes and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle–mutator–selector loop
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SelectQ(𝑆) = PassRate(𝑆) + 𝜖𝑈(𝑆) with dense metrics SkillQ, AgentProgressQ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023
work page 2023
-
[2]
Reflexion: Language agents with verbal reinforce- ment learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforce- ment learning. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[3]
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024
work page 2024
-
[4]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[5]
White, Doug Burger, and Chi Wang
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InConference on Language Modeling, 2024
work page 2024
-
[6]
MetaGPT: Meta programming for a multi-agent collaborative framework
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, and Chenglin Wu. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024
work page 2024
-
[7]
AgentBench: Evaluating LLMs as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kai Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024
work page 2024
-
[8]
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024
work page 2024
-
[9]
Verilogeval: Evaluating large language models for verilog code generation
Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. Verilogeval: Evaluating large language models for verilog code generation. InIEEE/ACM International Conference on Computer-Aided Design, 2023
work page 2023
-
[10]
RTLLM: An open-source benchmark for design RTL generation with large language model
Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. RTLLM: An open-source benchmark for design RTL generation with large language model. InAsia and South Pacific Design Automation Conference, 2024
work page 2024
-
[11]
PyHDL-Eval: An LLM evaluation framework for hardware design using python-embedded DSLs
Christopher Batten, Nathaniel Pinckney, Mingjie Liu, Haoxing Ren, and Brucek Khailany. PyHDL-Eval: An LLM evaluation framework for hardware design using python-embedded DSLs. InACM/IEEE Symposium on Machine Learning for CAD, 2024. Trace2Skill
work page 2024
-
[12]
Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, and Haoxing Ren. Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, 2025
work page 2025
-
[13]
Chenhui Deng, Yunsheng Bai, and Haoxing Ren. ChipAlign: Instruction align- ment in large language models for chip design via geodesic interpolation. In ACM/IEEE Design Automation Conference, 2025
work page 2025
-
[14]
ScaleRTL: Scaling LLMs with reasoning data and test-time compute for accurate RTL code generation
Chenhui Deng, Yun-Da Tsai, Guan-Ting Liu, Zhongzhi Yu, and Haoxing Ren. ScaleRTL: Scaling LLMs with reasoning data and test-time compute for accurate RTL code generation. InACM/IEEE Symposium on Machine Learning for CAD, 2025
work page 2025
-
[15]
RTLFixer: Automatically fixing RTL syntax errors with large language models
Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. RTLFixer: Automatically fixing RTL syntax errors with large language models. InACM/IEEE Design Automation Conference, 2024
work page 2024
-
[16]
Chia-Tung Ho, Haoxing Ren, and Brucek Khailany. VerilogCoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree-based waveform tracing tool. InAAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[17]
ACE-RTL: When agentic context evolution meets RTL-specialized LLMs, 2026
Chenhui Deng, Zhongzhi Yu, Guan-Ting Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. ACE-RTL: When agentic context evolution meets RTL-specialized LLMs, 2026. Zijian Du and Nathaniel Pinckney A ARTIFACT APPENDIX This appendix gives compact, trace-grounded artifacts used by the Trace2Skill pipeline. Full raw traces remain in the experiment direc...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.