pith. sign in

arxiv: 2605.21810 · v1 · pith:TVVUULYSnew · submitted 2026-05-20 · 💻 cs.AI · cs.MA

Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

Pith reviewed 2026-05-22 08:35 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords Trace2SkillCVDPhardware agentsskill evolutionverifier feedbacktest-time scalingVerilog designEDA
0
0 comments X

The pith

Trace2Skill evolves an agent's natural-language skills from rollout traces using verifier feedback to solve complex Verilog design problems without model updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Trace2Skill, a framework that treats an LLM agent's skills as an evolvable policy for tackling Complex Verilog Design Problems. It analyzes repeated rollout traces to identify success and failure modes, converts these into dense diagnostics and oracle lessons, and employs an oracle-mutator-selector loop to generate task-specific skills. These skills then guide the agent's search, editing, validation, and recovery processes. This approach connects skill descriptions directly to verifier evidence and behavior, leading to improved performance on difficult tasks that stump both the initial agent and advanced coding models. Importantly, it achieves these gains through test-time scaling alone, without any fine-tuning data, specialized training, or changes to model weights, and suggests applicability to other verifiable EDA tasks.

Core claim

Trace2Skill is a test-time scaling framework that improves hardware LLM agents on Complex Verilog Design Problems by mining repeated rollout traces for success and failure modes, converting them into dense diagnostics and oracle lessons, and using an oracle, mutator, and selector loop to produce task-specific skills that guide later actions, supported by bounded runtime dense verifier feedback that provides sanitized functional observations.

What carries the argument

The oracle-mutator-selector loop that evolves task-specific skills from traces and dense verifier feedback, serving as the mechanism to connect skill text to evidence and behavior.

Load-bearing premise

Repeated rollout traces contain extractable success and failure modes that can be reliably converted into dense diagnostics and oracle lessons capable of guiding effective skill evolution.

What would settle it

Running the Trace2Skill process on the same set of hard CVDP tasks and observing no increase in pass rates or no new solutions on previously unsolved tasks compared to the base agent.

Figures

Figures reproduced from arXiv: 2605.21810 by Nathaniel Pinckney, Zijian Du.

Figure 1
Figure 1. Figure 1: Trace2Skill end-to-end flow. Colors group component roles as shown in the legend; solid arrows mark the main [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AgentQ proxy on 96 completed OSS seed-skill base [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: C3 versus C4 quality dynamics on the 8 hard CVDP [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task-level C3/C4 verifier outcomes on the 8 hard [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Tree view of the repo-relative Trace2Skill implementation and representative run artifacts. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: NeMo Gym rollout sequence for one submitted run. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Trace2Skill, a test-time scaling framework for LLM-based hardware agents tackling Complex Verilog Design Problems (CVDP). It mines repeated rollout traces to extract success and failure modes, converts them into dense diagnostics and oracle lessons via an oracle-mutator-selector loop, and augments this with bounded runtime dense verifier feedback to guide skill evolution, editing, validation, and recovery. The central claim is that this approach substantially raises task pass rates on hard CVDP instances that defeat both the seed agent and frontier coding models, including breakthrough solutions on previously unsolved tasks, all without fine-tuning data, RTL-specialized training, or model weight updates. The framework is positioned as a general strategy extensible to other verifiable EDA tasks.

Significance. If the empirical results and transferability claims hold under scrutiny, the work offers a meaningful contribution to test-time scaling for long-context agents in hardware design. It demonstrates a path to improve performance on repository-scale Verilog problems by evolving natural-language skills from traces rather than relying on model updates, which could reduce dependence on high-quality fine-tuning corpora in EDA domains. The use of dense verifier feedback to connect skill text with functional observations is a practical mechanism worth further exploration.

major comments (3)
  1. [Abstract] Abstract: the central claim of substantial pass-rate improvements and breakthrough passes on unsolved tasks is stated without any quantitative results, error bars, ablation studies, dataset statistics, or baseline comparisons. This absence makes it impossible to assess effect sizes, statistical significance, or reproducibility from the provided text.
  2. [Framework description] Framework description (inferred §3): the skill evolution procedure assumes that repeated rollout traces contain extractable, transferable success/failure modes that the oracle-mutator-selector can reliably convert into reusable natural-language skills. No explicit mechanisms, skill representation format, mutation operators, or selection criteria are detailed to ensure abstraction beyond local, task-specific patches (e.g., include paths or build dependencies) in large Verilog repositories.
  3. [Evaluation section] Evaluation section (inferred §4): the claim that Trace2Skill succeeds on tasks defeating frontier coding agents rests on the untested assumption that dense verifier feedback produces generalizable diagnostics rather than brittle, non-transferable guidance. Without ablations isolating the contribution of the mutation/selection loop versus simple trace replay, the load-bearing role of skill evolution cannot be verified.
minor comments (2)
  1. [Method] Clarify the exact format in which evolved skills are stored and injected into the agent's prompt or policy at inference time.
  2. [Framework] Add a diagram or pseudocode for the oracle-mutator-selector loop to improve readability of the iterative process.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for highlighting areas where additional clarity and evidence would strengthen the manuscript. We address each major comment below and commit to revisions that improve transparency without altering the core claims or experimental setup.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of substantial pass-rate improvements and breakthrough passes on unsolved tasks is stated without any quantitative results, error bars, ablation studies, dataset statistics, or baseline comparisons. This absence makes it impossible to assess effect sizes, statistical significance, or reproducibility from the provided text.

    Authors: We agree that the abstract should convey key quantitative outcomes to enable immediate assessment of effect sizes. In the revised version we will add concise statements of the main pass-rate gains (including the number of tasks and comparison to seed and frontier baselines) while preserving the abstract's brevity; full statistics, error bars, ablations, and dataset details will remain in the evaluation section and supplementary material. revision: yes

  2. Referee: [Framework description] Framework description (inferred §3): the skill evolution procedure assumes that repeated rollout traces contain extractable, transferable success/failure modes that the oracle-mutator-selector can reliably convert into reusable natural-language skills. No explicit mechanisms, skill representation format, mutation operators, or selection criteria are detailed to ensure abstraction beyond local, task-specific patches (e.g., include paths or build dependencies) in large Verilog repositories.

    Authors: Section 3 already specifies the oracle-mutator-selector loop, the natural-language skill format (structured diagnostic patterns plus recovery heuristics), and the selection criterion (empirical success on held-out validation rollouts). To make these elements fully explicit and to demonstrate abstraction beyond local patches, we will insert a detailed algorithm box and additional prose describing the mutation operators (e.g., generalization from concrete fixes to reusable diagnostic templates) and the abstraction mechanisms used to avoid repository-specific artifacts. revision: yes

  3. Referee: [Evaluation section] Evaluation section (inferred §4): the claim that Trace2Skill succeeds on tasks defeating frontier coding agents rests on the untested assumption that dense verifier feedback produces generalizable diagnostics rather than brittle, non-transferable guidance. Without ablations isolating the contribution of the mutation/selection loop versus simple trace replay, the load-bearing role of skill evolution cannot be verified.

    Authors: The current evaluation already compares Trace2Skill against the seed agent and frontier models on the same hard CVDP tasks. We acknowledge that an explicit ablation separating the full oracle-mutator-selector loop from simple trace replay would further isolate the contribution of skill evolution. We will add this ablation in the revised manuscript and will also report cross-task transfer results to address concerns about diagnostic brittleness. These additions will be presented as new experiments rather than reinterpretation of existing data. revision: partial

Circularity Check

0 steps flagged

No circularity: procedural framework with no equations or self-referential reductions

full rationale

The paper presents Trace2Skill as a test-time scaling procedure that mines rollout traces, converts modes into diagnostics via an oracle-mutator-selector loop, and applies dense verifier feedback to evolve skills for CVDP tasks. No mathematical derivations, equations, or parameter-fitting steps appear in the abstract or description. Claims of improved pass rates rest on empirical application to hard tasks rather than any chain that reduces predictions to inputs by construction. No self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced. The method is self-contained as an algorithmic recipe whose validity is intended to be assessed externally through task performance, not internal definitional closure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that sanitized verifier observations can be turned into actionable lessons without access to hidden harnesses or reference solutions.

free parameters (1)
  • skill mutation and selection parameters
    Number of mutations, selection criteria, and feedback density thresholds are not quantified in the abstract but are required for the oracle-mutator-selector loop.
axioms (1)
  • domain assumption Bounded runtime dense verifier feedback can be sanitized to provide functional observations while keeping hidden harnesses inaccessible.
    Invoked when describing how feedback guides skill evolution without exposing reference solutions.

pith-pipeline@v0.9.0 · 5802 in / 1247 out tokens · 78401 ms · 2026-05-22T08:35:39.325452+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

  2. [2]

    Reflexion: Language agents with verbal reinforce- ment learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforce- ment learning. InAdvances in Neural Information Processing Systems, 2023

  3. [3]

    Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.Transactions on Machine Learning Research, 2024

  4. [4]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems, 2024

  5. [5]

    White, Doug Burger, and Chi Wang

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. InConference on Language Modeling, 2024

  6. [6]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, and Chenglin Wu. MetaGPT: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024

  7. [7]

    AgentBench: Evaluating LLMs as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kai Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents. InInternational Conference on Learning Representations, 2024

  8. [8]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, 2024

  9. [9]

    Verilogeval: Evaluating large language models for verilog code generation

    Mingjie Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. Verilogeval: Evaluating large language models for verilog code generation. InIEEE/ACM International Conference on Computer-Aided Design, 2023

  10. [10]

    RTLLM: An open-source benchmark for design RTL generation with large language model

    Yao Lu, Shang Liu, Qijun Zhang, and Zhiyao Xie. RTLLM: An open-source benchmark for design RTL generation with large language model. InAsia and South Pacific Design Automation Conference, 2024

  11. [11]

    PyHDL-Eval: An LLM evaluation framework for hardware design using python-embedded DSLs

    Christopher Batten, Nathaniel Pinckney, Mingjie Liu, Haoxing Ren, and Brucek Khailany. PyHDL-Eval: An LLM evaluation framework for hardware design using python-embedded DSLs. InACM/IEEE Symposium on Machine Learning for CAD, 2024. Trace2Skill

  12. [12]

    Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, 2025

    Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, and Haoxing Ren. Comprehensive verilog design problems: A next-generation benchmark dataset for evaluating large language models and agents on rtl design and verification, 2025

  13. [13]

    ChipAlign: Instruction align- ment in large language models for chip design via geodesic interpolation

    Chenhui Deng, Yunsheng Bai, and Haoxing Ren. ChipAlign: Instruction align- ment in large language models for chip design via geodesic interpolation. In ACM/IEEE Design Automation Conference, 2025

  14. [14]

    ScaleRTL: Scaling LLMs with reasoning data and test-time compute for accurate RTL code generation

    Chenhui Deng, Yun-Da Tsai, Guan-Ting Liu, Zhongzhi Yu, and Haoxing Ren. ScaleRTL: Scaling LLMs with reasoning data and test-time compute for accurate RTL code generation. InACM/IEEE Symposium on Machine Learning for CAD, 2025

  15. [15]

    RTLFixer: Automatically fixing RTL syntax errors with large language models

    Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. RTLFixer: Automatically fixing RTL syntax errors with large language models. InACM/IEEE Design Automation Conference, 2024

  16. [16]

    VerilogCoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree-based waveform tracing tool

    Chia-Tung Ho, Haoxing Ren, and Brucek Khailany. VerilogCoder: Autonomous verilog coding agents with graph-based planning and abstract syntax tree-based waveform tracing tool. InAAAI Conference on Artificial Intelligence, 2025

  17. [17]

    ACE-RTL: When agentic context evolution meets RTL-specialized LLMs, 2026

    Chenhui Deng, Zhongzhi Yu, Guan-Ting Liu, Nathaniel Pinckney, Brucek Khailany, and Haoxing Ren. ACE-RTL: When agentic context evolution meets RTL-specialized LLMs, 2026. Zijian Du and Nathaniel Pinckney A ARTIFACT APPENDIX This appendix gives compact, trace-grounded artifacts used by the Trace2Skill pipeline. Full raw traces remain in the experiment direc...