pith. machine review for the scientific record. sign in

arxiv: 2604.03245 · v1 · submitted 2026-03-06 · 💻 cs.AR · cs.AI· cs.SE

Recognition: no theorem link

FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:27 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.SE
keywords formal verificationsystemverilog assertionsrule learningoperator reasoning treeNL-to-SVA translationlarge language models
0
0 comments X

The pith

FVRuleLearner learns rules from an operator reasoning tree to raise functional correctness of SystemVerilog assertions generated from natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FVRuleLearner, which builds an Operator Reasoning Tree to break down the translation of natural language specifications into SystemVerilog Assertions into fine-grained, operator-specific reasoning steps. Successful reasoning paths are stored during training, then retrieved at test time to create rules for new inputs. This structured approach targets the main weaknesses of large language models in formal verification: poor operator selection and limited training data. If effective, it reduces manual effort in writing assertions that hardware designers rely on to prove correctness. The reported gains show fewer syntax and functional errors across multiple operator categories.

Core claim

FVRuleLearner models NL-to-SVA generation as structured reasoning over an OP-Tree that decomposes the task into operator-aware questions and combines paths leading to correct assertions. In the training phase the tree is built from successful examples; in testing it retrieves matching operator traces to produce rules for unseen specifications. This yields an average 3.95 percent gain in syntax correctness and 31.17 percent gain in functional correctness over the prior baseline, while cutting functional failures by 70.33 percent on average across operator categories.

What carries the argument

The Operator Reasoning Tree (OP-Tree), which decomposes NL-to-SVA alignment into fine-grained operator-aware questions and stores complete reasoning paths that produced correct assertions for later retrieval.

Load-bearing premise

The learned tree can retrieve reasoning paths that remain correct for entirely new natural-language specifications without creating fresh functional errors.

What would settle it

Run the method on a fresh set of natural-language hardware specifications never seen in training and measure whether the fraction of functional failures drops below 50 percent or syntax correctness falls below the baseline.

Figures

Figures reproduced from arXiv: 2604.03245 by Chia-Tung Ho, Cunxi Yu, Deming Chen, Haoxing Ren, Lily Jiaxin Wan, Yunsheng Bai.

Figure 1
Figure 1. Figure 1: illustrates an example of these limitations. The challenges in translating NL specifications to SVAs (NL-to￾SVA) have sparked interest in applying Large Language Models (LLMs) to automate and improve productivity in digital design verification [8], [9], [10]. Despite their proficiency in generating code *Corresponding authors: Lily Jiaxin Wan (wan25@illinois.edu) and Chia￾Tung Ho (chiatungh@nvidia.com). Qu… view at source ↗
Figure 2
Figure 2. Figure 2: Operator-level functional mismatch analysis between generated and golden SVAs across three datasets. Temporal operators dominate the errors, highlighting the need for operator-aware reasoning. B. Operator-Level Functional Mismatch Analysis We categorize SVA operators into six groups: temporal implication, temporal delay, temporal sampling, temporal liveness, combinational logic, and miscellaneous structura… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between General Rules and Op-Rules. temporal operators, this ambiguity leads to incorrect SVA generation. In contrast, the Op-Rule succeeds by enforcing execution over inter￾pretation. By implementing a deterministic, symbol-level directive rather than relying on semantic reasoning, the Op-Rule ensures the generation of the correct overlapping implication. IV. PROPOSED APPROACH: FVRULELEARNER We… view at source ↗
Figure 4
Figure 4. Figure 4: The flow overview of FVRULELEARNER and an illustration of OP-TREE in training. (a) Training phase: The framework leverages OP-TREE to consolidate the operator-level reasoning traces and they are stored in T ∗. (b) Testing phase: FVRULELEARNER firstly generates initial SVA (yˆinit) and retrieve relevant OP-TREE from T ∗ according to the similarity score of natural language specification of testing data and … view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of fixing ratio trends over training iterations for FVRULELEARNER (circle) versus General Rule Learning (square) methods. Across (a) NL2SVA-HUMAN, (b) NL2SVA-MACHINE, and (c) NL2SVA-OPENCORE, the proposed OP-TREE rules learning methodology consistently accelerates convergence and achieves higher final accuracy, improving the fixing rate by an average of 23.33%. Since each baseline, as well as ou… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of Training Data Ratio on FVRULELEARNER Performance on the NL2SVA-MACHINE dataset. TABLE III: Rule Retrieval Ablation Study Results for O3-MINI on NL2SVA-MACHINE Dataset. Model / Method SYN FUNC R.FUNC O3-MINI FVEVAL-3-SHOT 0.9500 0.4833 0.5833 General Rules 0.9500 0.6667 0.7500 Static Op-Rules 0.9500 0.7167 0.7833 FVRULELEARNER 0.9667 0.7833 0.8333 functional correctness across three datasets using… view at source ↗
read the original abstract

The remarkable reasoning and code generation capabilities of large language models (LLMs) have recently motivated increasing interest in automating formal verification (FV), a process that ensures hardware correctness through mathematically precise assertions but remains highly labor-intensive, particularly through the translation of natural language into SystemVerilog Assertions (NL-to-SVA). However, LLMs still struggle with SVA generation due to limited training data and the intrinsic complexity of FV operators. Consequently, a more efficient and robust methodology for ensuring correct SVA operator selection is essential for producing functionally correct assertions. To address these challenges, we introduce FVRuleLearner, an Operator-Level Rule (Op-Rule) learning framework built on a novel Operator Reasoning Tree (OP-Tree), which models SVA generation as structured, interpretable reasoning. FVRuleLearner operates in two complementary phases: (1) Training: it constructs OP-Tree that decomposes NL-to-SVA alignment into fine-grained, operator-aware questions, combining reasoning paths that lead to correct assertions; and (2) Testing: it performs operator-aligned retrieval to fetch relevant reasoning traces from the learned OP-Tree and generate new rules for unseen specifications. In the comprehensive studies, the proposed FVRuleLearner outperforms the state-of-the-art baseline by 3.95% in syntax correctness and by 31.17% in functional correctness on average. Moreover, FVRuleLearner successfully reduces an average of 70.33% of SVA functional failures across diverse operator categories through a functional taxonomy analysis, showing the effectiveness of applying learned OP-Tree to the Op-Rule generations for unseen NL-to-SVA tasks. These results establish FVRuleLearner as a new paradigm for domain-specific reasoning and rule learning in formal verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FVRuleLearner, a two-phase framework for NL-to-SVA translation in formal verification. In training, it builds an Operator Reasoning Tree (OP-Tree) by decomposing specifications into operator-aware questions and retaining only reasoning paths that produce correct assertions. In testing, it performs operator-aligned retrieval from the learned tree to generate rules for unseen natural-language specifications. The central claims are average improvements of 3.95% in syntax correctness and 31.17% in functional correctness over a state-of-the-art baseline, plus a 70.33% average reduction in SVA functional failures across operator categories.

Significance. If the reported gains are reproducible and the generalization holds, the work would represent a meaningful step toward more reliable automated assertion generation by replacing opaque LLM prompting with an interpretable, operator-structured retrieval mechanism. The OP-Tree construction and functional-taxonomy analysis are concrete contributions that could be extended to other verification domains.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental Setup): the headline performance numbers (3.95% syntax, 31.17% functional correctness, 70.33% failure reduction) are presented without dataset cardinality, number of test specifications, baseline implementation details, or any statistical significance tests. These omissions make it impossible to judge whether the observed deltas are reliable or could be explained by sampling variance.
  2. [§3.2 and §5] §3.2 (Testing Phase) and §5 (Results): the claim that operator-aligned retrieval produces functionally correct rules for completely unseen specifications rests on the untested assumption that surface-level operator matching preserves semantic equivalence. No out-of-distribution split by operator novelty, no ablation that removes specific operators from training, and no independent model-checking verification of generated assertions (beyond the reported aggregate metrics) are provided.
  3. [§3.1] §3.1 (OP-Tree Construction): the training procedure selects only reasoning paths that yield correct assertions, but the manuscript supplies no quantitative characterization of how many paths are discarded, how operator coverage in the retained tree compares to the test distribution, or whether the tree can represent operator combinations absent from training.
minor comments (2)
  1. [§3] Notation for the OP-Tree nodes and retrieval scoring function is introduced without a compact mathematical definition or pseudocode, making the exact retrieval algorithm difficult to re-implement.
  2. [§5] Figure captions and axis labels in the functional-taxonomy plots should explicitly state the number of specifications per operator category.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that identify key areas for strengthening the experimental rigor and clarity of our work. We address each major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): the headline performance numbers (3.95% syntax, 31.17% functional correctness, 70.33% failure reduction) are presented without dataset cardinality, number of test specifications, baseline implementation details, or any statistical significance tests. These omissions make it impossible to judge whether the observed deltas are reliable or could be explained by sampling variance.

    Authors: We agree that these details are necessary to assess result reliability. In the revised manuscript we will expand the abstract and §4 to report the exact dataset cardinality (training and test specification counts), full baseline implementation details (including prompting strategies and model versions), and statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) to confirm the improvements exceed sampling variance. revision: yes

  2. Referee: [§3.2 and §5] §3.2 (Testing Phase) and §5 (Results): the claim that operator-aligned retrieval produces functionally correct rules for completely unseen specifications rests on the untested assumption that surface-level operator matching preserves semantic equivalence. No out-of-distribution split by operator novelty, no ablation that removes specific operators from training, and no independent model-checking verification of generated assertions (beyond the reported aggregate metrics) are provided.

    Authors: Operator alignment follows the functional taxonomy in which each operator defines a distinct semantic category; surface matching therefore corresponds to category-level equivalence by construction. To strengthen the claim we will add an OOD split by operator novelty and an ablation that removes specific operators from training. We will also clarify that functional correctness is obtained via independent model checking of each generated assertion against its specification (using standard FV tools), not solely aggregate metrics. revision: partial

  3. Referee: [§3.1] §3.1 (OP-Tree Construction): the training procedure selects only reasoning paths that yield correct assertions, but the manuscript supplies no quantitative characterization of how many paths are discarded, how operator coverage in the retained tree compares to the test distribution, or whether the tree can represent operator combinations absent from training.

    Authors: We will revise §3.1 to include quantitative characterization of OP-Tree construction: the fraction of reasoning paths retained after correctness filtering, operator-coverage statistics comparing the retained tree to the test distribution, and an analysis showing how the hierarchical structure encodes unseen operator combinations via shared sub-paths. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical retrieval on unseen specs is self-contained

full rationale

The paper constructs an OP-Tree during training by decomposing NL-to-SVA into operator-aware paths that produce correct assertions, then uses operator-aligned retrieval at test time to generate rules for unseen specifications. Reported gains (3.95% syntax correctness, 31.17% functional correctness, 70.33% failure reduction) are measured via direct comparison against baselines on held-out data, not by any equation or definition that reduces the output to a fitted input or self-citation. No mathematical derivations, self-definitional loops, or load-bearing prior-author uniqueness theorems appear in the text. The method therefore remains externally falsifiable on new NL specs and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework assumes that decomposing NL-to-SVA tasks into operator-aware questions yields reusable reasoning paths that generalize to unseen specifications. No explicit free parameters or invented physical entities are stated.

axioms (1)
  • domain assumption LLMs benefit from structured operator-aware reasoning paths when generating SVA
    Invoked in the training phase that constructs the OP-Tree from correct assertion examples.
invented entities (1)
  • Operator Reasoning Tree (OP-Tree) no independent evidence
    purpose: Models SVA generation as structured, interpretable reasoning decomposed by operators
    New structure introduced to organize reasoning paths; no external falsifiable evidence provided in abstract.

pith-pipeline@v0.9.0 · 5641 in / 1256 out tokens · 33596 ms · 2026-05-15T14:27:56.401190+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 3 internal anchors

  1. [1]

    H. Foster. (2021) Part 8: The 2020 wilson research group functional verification study. Siemens EDA. Accessed: 2024-07-25. [Online]. Avail- able: https://blogs.sw.siemens.com/verificationhorizons/2021/01/06/part- 8-the-2020-wilson-research-group-functional-verification-study/

  2. [2]

    Formal verification methods,

    O. Hasan and S. Tahar, “Formal verification methods,” inEncyclopedia of Information Science and Technology, Third Edition. IGI global, 2015, pp. 7162–7170

  3. [3]

    Seligman, T

    E. Seligman, T. Schubert, and M. A. K. Kumar,Formal verification: an essential toolkit for modern VLSI design. Elsevier, 2023

  4. [4]

    Vijayaraghavan and M

    S. Vijayaraghavan and M. Ramanathan,A practical guide for SystemVer- ilog assertions. Springer Science & Business Media, 2005

  5. [5]

    Cerny, S

    E. Cerny, S. Dudani, J. Havlicek, D. Korchemnyet al.,SVA: the power of assertions in systemVerilog. Springer, 2015

  6. [6]

    Systemverilog assertions design tricks and sva bind files,

    C. E. Cummings, “Systemverilog assertions design tricks and sva bind files,” inSNUG, 2009, pp. 1–42

  7. [7]

    Introduction to the new accellera open verification library,

    H. Foster, K. Larsen, and M. Turpin, “Introduction to the new accellera open verification library,” inDVCon’06: Proceedings of the Design and Verification Conference and exhibition. Citeseer, 2006

  8. [8]

    Chipnemo: Domain-adapted llms for chip design,

    M. Liuet al., “Chipnemo: Domain-adapted llms for chip design,” arXiv:2311.00176, 2023

  9. [9]

    Llm-assisted generation of hardware assertions,

    R. Kande, H. Pearce, B. Tan, B. Dolan-Gavitt, S. Thakur, R. Karri, and J. Rajendran, “Llm-assisted generation of hardware assertions,”arXiv preprint arXiv:2306.14027, 2023

  10. [10]

    Using llms to facilitate formal verification of rtl,

    M. Orenes-Vera, M. Martonosi, and D. Wentzlaff, “Using llms to facilitate formal verification of rtl,”arXiv e-prints, pp. arXiv–2309, 2023

  11. [11]

    Using an llm to help with code understanding,

    D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

  12. [12]

    When llm-based code generation meets the software development process,

    F. Lin, D. J. Kimet al., “When llm-based code generation meets the software development process,”arXiv preprint arXiv:2403.15852, 2024

  13. [13]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  14. [14]

    Where do large language models fail when generating code?

    Z. Wang, Z. Zhou, D. Song, Y . Huang, S. Chen, L. Ma, and T. Zhang, “Where do large language models fail when generating code?”arXiv preprint arXiv:2406.08731, 2024

  15. [15]

    Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,

    W. Fang, M. Li, M. Li, Z. Yan, S. Liu, H. Zhang, and Z. Xie, “Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,”arXiv preprint arXiv:2402.00386, 2024

  16. [16]

    Opencores: Open source hardware designs,

    OpenCores, “Opencores: Open source hardware designs,” OpenCores Website, 2024, accessed: June 28, 2024. [Online]. Available: https: //opencores.org/

  17. [17]

    Automatic generation of invariants and intermediate assertions,

    N. Bjørner, A. Browne, and Z. Manna, “Automatic generation of invariants and intermediate assertions,”Theoretical Computer Science, vol. 173, no. 1, pp. 49–87, 1997

  18. [18]

    Block-based schema-driven assertion generation for functional verification,

    A. Hekmatpour and A. Salehi, “Block-based schema-driven assertion generation for functional verification,” in14th Asian Test Symposium (ATS’05). IEEE, 2005, pp. 34–39

  19. [19]

    A survey of machine learning applications in functional verification,

    D. Yu, H. Foster, and T. Fitzpatrick, “A survey of machine learning applications in functional verification,”DVCon US, 2023

  20. [20]

    Fveval: Evaluating llms on hardware formal verification,

    M. J. Kang and H. Ren, “Fveval: Evaluating llms on hardware formal verification,” https://github.com/NVlabs/FVEval, 2024

  21. [21]

    Assertionbench: A benchmark to evaluate large-language models for assertion generation,

    V . Pulavarthi, D. Nandal, S. Dan, and D. Pal, “Assertionbench: A benchmark to evaluate large-language models for assertion generation,” arXiv preprint arXiv:2406.18627, 2024

  22. [22]

    Towards improving verification pro- ductivity with circuit-aware translation of natural language to systemver- ilog assertions,

    C. Sun, C. Hahn, and C. Trippel, “Towards improving verification pro- ductivity with circuit-aware translation of natural language to systemver- ilog assertions,” inFirst International Workshop on Deep Learning-aided Verification, 2023

  23. [23]

    Chiraag: Chatgpt informed rapid and automated assertion generation,

    B. Mali, K. Maddala, S. Reddy, V . Gupta, C. Karfa, and R. Karri, “Chiraag: Chatgpt informed rapid and automated assertion generation,” arXiv preprint arXiv:2402.00093, 2024

  24. [24]

    Llm-guided formal verification coupled with mutation testing,

    M. Hassan, S. Ahmadi-Pour, K. Qayyum, C. K. Jha, and R. Drechsler, “Llm-guided formal verification coupled with mutation testing,” in2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024, pp. 1–2

  25. [25]

    Laag-rv: Llm assisted assertion generation for rtl design verification,

    K. Maddala, B. Mali, and C. Karfa, “Laag-rv: Llm assisted assertion generation for rtl design verification,” in2024 IEEE 8th International Test Conference India (ITC India). IEEE, 2024, pp. 1–6

  26. [27]

    Available: https://arxiv.org/abs/2506.21569

    [Online]. Available: https://arxiv.org/abs/2506.21569

  27. [28]

    Sqlfixagent: Towards semantic- accurate sql generation via multi-agent collaboration,

    J. Cen, J. Liu, Z. Li, and J. Wang, “Sqlfixagent: Towards semantic- accurate sql generation via multi-agent collaboration,”arXiv preprint arXiv:2406.13408, 2024

  28. [29]

    Can large language models transform natural language intent into formal method postconditions?

    M. Endres, S. Fakhoury, S. Chakraborty, and S. K. Lahiri, “Can large language models transform natural language intent into formal method postconditions?”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1889–1912, 2024

  29. [30]

    Beyond code generation: Assessing code llm maturity with postconditions,

    F. He, J. Zhai, and M. Pan, “Beyond code generation: Assessing code llm maturity with postconditions,”arXiv preprint arXiv:2407.14118, 2024

  30. [31]

    Code agents are state of the art software testers,

    N. M ¨undler, M. N. M ¨uller, J. He, and M. Vechev, “Code agents are state of the art software testers,”arXiv preprint arXiv:2406.12952, 2024

  31. [32]

    Effective large language model debugging with best-first tree search,

    J. Song, J. Raiman, and B. Catanzaro, “Effective large language model debugging with best-first tree search,”arXiv preprint arXiv:2407.19055, 2024

  32. [33]

    Thinkrepair: Self- directed automated program repair,

    X. Yin, C. Ni, S. Wang, Z. Li, L. Zeng, and X. Yang, “Thinkrepair: Self- directed automated program repair,”arXiv preprint arXiv:2407.20898, 2024

  33. [34]

    Large language model (llm) for standard cell layout design optimization,

    C.-T. Ho and H. Ren, “Large language model (llm) for standard cell layout design optimization,”arXiv preprint arXiv:2406.06549, 2024

  34. [35]

    Introducing claude sonnet 4.5,

    Anthropic PBC, “Introducing claude sonnet 4.5,” https://www.anthropic. com/news/claude-sonnet-4-5, Sep. 2025, accessed: 2025-11-06

  35. [37]
  36. [38]

    Rtlfixer: Automatically fixing rtl syntax errors with large language models,

    Y . Tsaiet al., “Rtlfixer: Automatically fixing rtl syntax errors with large language models,”arXiv:2311.16543, 2023

  37. [39]

    Verilogcoder: Autonomous verilog coding agents with graph-based planning and ast-based waveform tracing tool,

    C.-T. Hoet al., “Verilogcoder: Autonomous verilog coding agents with graph-based planning and ast-based waveform tracing tool,” inAAAI,

  38. [40]

    Available: https://doi.org/10.1609/aaai.v39i1.32007

    [Online]. Available: https://doi.org/10.1609/aaai.v39i1.32007

  39. [41]

    Schemacoder: Automatic log schema extraction coder with residual q-tree boosting,

    J. Wan, C.-T. Ho, R. Liang, C. Yu, D. Chen, and H. Ren, “Schemacoder: Automatic log schema extraction coder with residual q-tree boosting,” arXiv preprint arXiv:2508.18554, 2025, preprint. URL: https://arxiv.org/ abs/2508.18554

  40. [42]

    Chateda: A large language model powered autonomous agent for eda,

    Z. Heet al., “Chateda: A large language model powered autonomous agent for eda,” inMLCAD, 2023

  41. [43]

    Internal consistency and self-feedback in large language models: A survey,

    X. Liang, S. Song, Z. Zheng, H. Wang, Q. Yu, X. Li, R.-H. Li, F. Xiong, and Z. Li, “Internal consistency and self-feedback in large language models: A survey,”arXiv preprint arXiv:2407.14507, 2024

  42. [44]

    Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

    L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y . Wang, “Au- tomatically correcting large language models: Surveying the landscape of diverse self-correction strategies,”arXiv preprint arXiv:2308.03188, 2023

  43. [45]

    When can llms actually correct their own mistakes? a critical survey of self-correction of llms,

    R. Kamoi, Y . Zhang, N. Zhang, J. Han, and R. Zhang, “When can llms actually correct their own mistakes? a critical survey of self-correction of llms,”arXiv preprint arXiv:2406.01297, 2024

  44. [46]

    Training llms to better self-debug and explain code,

    N. Jiang, X. Li, S. Wang, Q. Zhou, S. B. Hossain, B. Ray, V . Kumar, X. Ma, and A. Deoras, “Training llms to better self-debug and explain code,”arXiv preprint arXiv:2405.18649, 2024

  45. [47]

    Self-refine: Iter- ative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in Neural Information Processing Systems, vol. 36, 2023

  46. [48]

    Towards llm-powered verilog rtl assistant: Self-verification and self-correction,

    H. Huang, Z. Lin, Z. Wang, X. Chen, K. Ding, and J. Zhao, “Towards llm-powered verilog rtl assistant: Self-verification and self-correction,” arXiv preprint arXiv:2406.00115, 2024

  47. [49]

    Re- flexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, 2024

  48. [50]

    Learning-from-mistakes prompting for indigenous language translation,

    Y .-C. Liao, C.-J. Yu, C.-Y . Lin, H.-F. Yun, Y .-H. Wang, H.-M. Li, and Y .-C. Fan, “Learning-from-mistakes prompting for indigenous language translation,”arXiv preprint arXiv:2407.13343, 2024

  49. [51]

    Recursive introspection: Teaching language model agents how to self-improve,

    Y . Qu, T. Zhang, N. Garg, and A. Kumar, “Recursive introspection: Teaching language model agents how to self-improve,”arXiv preprint arXiv:2407.18219, 2024

  50. [52]

    Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning,

    Y . Tong, D. Li, S. Wang, Y . Wang, F. Teng, and J. Shang, “Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning,”arXiv preprint arXiv:2403.20046, 2024

  51. [53]

    In-context principle learning from mistakes,

    T. Zhang, A. Madaan, L. Gao, S. Zheng, S. Mishra, Y . Yang, N. Tandon, and U. Alon, “In-context principle learning from mistakes,”arXiv preprint arXiv:2402.05403, 2024

  52. [54]

    Retrieved in-context principles from previous mistakes,

    H. Sun, Y . Jiang, B. Wang, Y . Hou, Y . Zhang, P. Xie, and F. Huang, “Retrieved in-context principles from previous mistakes,”arXiv preprint arXiv:2407.05682, 2024

  53. [55]

    Empowering large language models to set up a knowledge retrieval indexer via self-learning,

    X. Liang, S. Niu, S. Zhang, S. Song, H. Wang, J. Yang, F. Xiong, B. Tang, C. Xiet al., “Empowering large language models to set up a knowledge retrieval indexer via self-learning,”arXiv preprint arXiv:2405.16933, 2024

  54. [56]

    Collaborative evolving strategy for automatic data- centric development,

    X. Yang, H. Chen, W. Feng, H. Wang, Z. Ye, X. Shen, X. Yang, S. Sun, W. Liu, and J. Bian, “Collaborative evolving strategy for automatic data- centric development,”arXiv preprint arXiv:2407.18690, 2024

  55. [57]

    Large language models as optimizers,

    C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen, “Large language models as optimizers,” 2024

  56. [58]

    Offline training of language model agents with functions as learnable weights,

    S. Zhang, J. Zhang, J. Liu, L. Song, C. Wang, R. Krishna, and Q. Wu, “Offline training of language model agents with functions as learnable weights,” inForty-first International Conference on Machine Learning, 2024

  57. [59]

    Symbolic learning enables self-evolving agents,

    W. Zhou, Y . Ou, S. Ding, L. Li, J. Wu, T. Wang, J. Chen, S. Wang, X. Xu, N. Zhanget al., “Symbolic learning enables self-evolving agents,”arXiv preprint arXiv:2406.18532, 2024

  58. [60]

    TextGrad: Automatic "Differentiation" via Text

    M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, and J. Zou, “Textgrad: Automatic” differentiation” via text,”arXiv preprint arXiv:2406.07496, 2024

  59. [61]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

  60. [62]

    Cadence jaspergold formal verification platform,

    Cadence Design Systems, “Cadence jaspergold formal verification platform,” https://www.cadence.com/en US/home/tools/system-design- and-verification/formal-and-static-verification/jasper-gold-verification- platform.html, 2024, accessed: 2024-07-31

  61. [63]

    A simple ensemble strategy for llm inference: Towards more stable text classification,

    J. Niimi, “A simple ensemble strategy for llm inference: Towards more stable text classification,” inInternational Conference on Applications of Natural Language to Information Systems. Springer, 2025, pp. 189– 199

  62. [64]

    GPT-4 Technical Report

    OpenAIet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  63. [65]

    Openai o3-mini,

    OpenAI, “Openai o3-mini,” https://openai.com/index/openai-o3-mini/, Jan. 2025, accessed: 2025-11-06