arxiv: 2604.03245 · v1 · submitted 2026-03-06 · 💻 cs.AR · cs.AI· cs.SE

Recognition: no theorem link

FVRuleLearner: Operator-Level Reasoning Tree (OP-Tree)-Based Rules Learning for Formal Verification

Lily Jiaxin Wan , Chia-Tung Ho , Yunsheng Bai , Cunxi Yu , Deming Chen , Haoxing Ren

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:27 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.SE

keywords formal verificationsystemverilog assertionsrule learningoperator reasoning treeNL-to-SVA translationlarge language models

0 comments

The pith

FVRuleLearner learns rules from an operator reasoning tree to raise functional correctness of SystemVerilog assertions generated from natural language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FVRuleLearner, which builds an Operator Reasoning Tree to break down the translation of natural language specifications into SystemVerilog Assertions into fine-grained, operator-specific reasoning steps. Successful reasoning paths are stored during training, then retrieved at test time to create rules for new inputs. This structured approach targets the main weaknesses of large language models in formal verification: poor operator selection and limited training data. If effective, it reduces manual effort in writing assertions that hardware designers rely on to prove correctness. The reported gains show fewer syntax and functional errors across multiple operator categories.

Core claim

FVRuleLearner models NL-to-SVA generation as structured reasoning over an OP-Tree that decomposes the task into operator-aware questions and combines paths leading to correct assertions. In the training phase the tree is built from successful examples; in testing it retrieves matching operator traces to produce rules for unseen specifications. This yields an average 3.95 percent gain in syntax correctness and 31.17 percent gain in functional correctness over the prior baseline, while cutting functional failures by 70.33 percent on average across operator categories.

What carries the argument

The Operator Reasoning Tree (OP-Tree), which decomposes NL-to-SVA alignment into fine-grained operator-aware questions and stores complete reasoning paths that produced correct assertions for later retrieval.

Load-bearing premise

The learned tree can retrieve reasoning paths that remain correct for entirely new natural-language specifications without creating fresh functional errors.

What would settle it

Run the method on a fresh set of natural-language hardware specifications never seen in training and measure whether the fraction of functional failures drops below 50 percent or syntax correctness falls below the baseline.

Figures

Figures reproduced from arXiv: 2604.03245 by Chia-Tung Ho, Cunxi Yu, Deming Chen, Haoxing Ren, Lily Jiaxin Wan, Yunsheng Bai.

**Figure 1.** Figure 1: illustrates an example of these limitations. The challenges in translating NL specifications to SVAs (NL-toSVA) have sparked interest in applying Large Language Models (LLMs) to automate and improve productivity in digital design verification [8], [9], [10]. Despite their proficiency in generating code *Corresponding authors: Lily Jiaxin Wan (wan25@illinois.edu) and ChiaTung Ho (chiatungh@nvidia.com). Qu… view at source ↗

**Figure 2.** Figure 2: Operator-level functional mismatch analysis between generated and golden SVAs across three datasets. Temporal operators dominate the errors, highlighting the need for operator-aware reasoning. B. Operator-Level Functional Mismatch Analysis We categorize SVA operators into six groups: temporal implication, temporal delay, temporal sampling, temporal liveness, combinational logic, and miscellaneous structura… view at source ↗

**Figure 3.** Figure 3: Comparison between General Rules and Op-Rules. temporal operators, this ambiguity leads to incorrect SVA generation. In contrast, the Op-Rule succeeds by enforcing execution over interpretation. By implementing a deterministic, symbol-level directive rather than relying on semantic reasoning, the Op-Rule ensures the generation of the correct overlapping implication. IV. PROPOSED APPROACH: FVRULELEARNER We… view at source ↗

**Figure 4.** Figure 4: The flow overview of FVRULELEARNER and an illustration of OP-TREE in training. (a) Training phase: The framework leverages OP-TREE to consolidate the operator-level reasoning traces and they are stored in T ∗. (b) Testing phase: FVRULELEARNER firstly generates initial SVA (yˆinit) and retrieve relevant OP-TREE from T ∗ according to the similarity score of natural language specification of testing data and … view at source ↗

**Figure 5.** Figure 5: Comparison of fixing ratio trends over training iterations for FVRULELEARNER (circle) versus General Rule Learning (square) methods. Across (a) NL2SVA-HUMAN, (b) NL2SVA-MACHINE, and (c) NL2SVA-OPENCORE, the proposed OP-TREE rules learning methodology consistently accelerates convergence and achieves higher final accuracy, improving the fixing rate by an average of 23.33%. Since each baseline, as well as ou… view at source ↗

**Figure 6.** Figure 6: Effect of Training Data Ratio on FVRULELEARNER Performance on the NL2SVA-MACHINE dataset. TABLE III: Rule Retrieval Ablation Study Results for O3-MINI on NL2SVA-MACHINE Dataset. Model / Method SYN FUNC R.FUNC O3-MINI FVEVAL-3-SHOT 0.9500 0.4833 0.5833 General Rules 0.9500 0.6667 0.7500 Static Op-Rules 0.9500 0.7167 0.7833 FVRULELEARNER 0.9667 0.7833 0.8333 functional correctness across three datasets using… view at source ↗

read the original abstract

The remarkable reasoning and code generation capabilities of large language models (LLMs) have recently motivated increasing interest in automating formal verification (FV), a process that ensures hardware correctness through mathematically precise assertions but remains highly labor-intensive, particularly through the translation of natural language into SystemVerilog Assertions (NL-to-SVA). However, LLMs still struggle with SVA generation due to limited training data and the intrinsic complexity of FV operators. Consequently, a more efficient and robust methodology for ensuring correct SVA operator selection is essential for producing functionally correct assertions. To address these challenges, we introduce FVRuleLearner, an Operator-Level Rule (Op-Rule) learning framework built on a novel Operator Reasoning Tree (OP-Tree), which models SVA generation as structured, interpretable reasoning. FVRuleLearner operates in two complementary phases: (1) Training: it constructs OP-Tree that decomposes NL-to-SVA alignment into fine-grained, operator-aware questions, combining reasoning paths that lead to correct assertions; and (2) Testing: it performs operator-aligned retrieval to fetch relevant reasoning traces from the learned OP-Tree and generate new rules for unseen specifications. In the comprehensive studies, the proposed FVRuleLearner outperforms the state-of-the-art baseline by 3.95% in syntax correctness and by 31.17% in functional correctness on average. Moreover, FVRuleLearner successfully reduces an average of 70.33% of SVA functional failures across diverse operator categories through a functional taxonomy analysis, showing the effectiveness of applying learned OP-Tree to the Op-Rule generations for unseen NL-to-SVA tasks. These results establish FVRuleLearner as a new paradigm for domain-specific reasoning and rule learning in formal verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OP-Tree decomposes NL-to-SVA into operator steps and reports solid-looking gains, but the evaluation leaves the generalization claims hard to verify.

read the letter

The core idea here is to build an Operator Reasoning Tree that turns NL-to-SVA into a set of operator-specific questions, keeps only the paths that produce correct assertions during training, and then retrieves matching traces at test time for new specs. That structure is the actual novelty relative to standard prompting or retrieval baselines mentioned in the abstract. The functional taxonomy breakdown across operator categories is a practical addition that shows where the method reduces failures most. Those pieces give the work a clearer handle on why certain assertions go wrong than plain LLM output usually provides. The reported numbers—roughly 4% syntax lift, 31% functional correctness lift, and 70% average failure reduction—are large enough to notice if they hold up. The soft spot is the missing experimental scaffolding. No dataset size, no description of the baseline implementations, no mention of statistical tests, and no explicit check for operator combinations that were absent from training. The stress-test concern about retrieval introducing new functional errors on truly unseen specs is still open because the abstract does not report an out-of-distribution split or independent model-checking results beyond the headline metrics. If the full paper supplies those controls and shows the tree actually transfers operator semantics rather than surface patterns, the gains become more credible. This is aimed at people already working on assertion generation or LLM tooling inside EDA. A hardware-verification reader could pick up the tree construction and taxonomy analysis even if they end up re-running the experiments themselves. It is worth sending to peer review because the problem is concrete, the operator focus is a reasonable direction, and the field needs reproducible baselines more than it needs another high-level claim.

Referee Report

3 major / 2 minor

Summary. The paper introduces FVRuleLearner, a two-phase framework for NL-to-SVA translation in formal verification. In training, it builds an Operator Reasoning Tree (OP-Tree) by decomposing specifications into operator-aware questions and retaining only reasoning paths that produce correct assertions. In testing, it performs operator-aligned retrieval from the learned tree to generate rules for unseen natural-language specifications. The central claims are average improvements of 3.95% in syntax correctness and 31.17% in functional correctness over a state-of-the-art baseline, plus a 70.33% average reduction in SVA functional failures across operator categories.

Significance. If the reported gains are reproducible and the generalization holds, the work would represent a meaningful step toward more reliable automated assertion generation by replacing opaque LLM prompting with an interpretable, operator-structured retrieval mechanism. The OP-Tree construction and functional-taxonomy analysis are concrete contributions that could be extended to other verification domains.

major comments (3)

[Abstract and §4] Abstract and §4 (Experimental Setup): the headline performance numbers (3.95% syntax, 31.17% functional correctness, 70.33% failure reduction) are presented without dataset cardinality, number of test specifications, baseline implementation details, or any statistical significance tests. These omissions make it impossible to judge whether the observed deltas are reliable or could be explained by sampling variance.
[§3.2 and §5] §3.2 (Testing Phase) and §5 (Results): the claim that operator-aligned retrieval produces functionally correct rules for completely unseen specifications rests on the untested assumption that surface-level operator matching preserves semantic equivalence. No out-of-distribution split by operator novelty, no ablation that removes specific operators from training, and no independent model-checking verification of generated assertions (beyond the reported aggregate metrics) are provided.
[§3.1] §3.1 (OP-Tree Construction): the training procedure selects only reasoning paths that yield correct assertions, but the manuscript supplies no quantitative characterization of how many paths are discarded, how operator coverage in the retained tree compares to the test distribution, or whether the tree can represent operator combinations absent from training.

minor comments (2)

[§3] Notation for the OP-Tree nodes and retrieval scoring function is introduced without a compact mathematical definition or pseudocode, making the exact retrieval algorithm difficult to re-implement.
[§5] Figure captions and axis labels in the functional-taxonomy plots should explicitly state the number of specifications per operator category.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that identify key areas for strengthening the experimental rigor and clarity of our work. We address each major comment point by point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Setup): the headline performance numbers (3.95% syntax, 31.17% functional correctness, 70.33% failure reduction) are presented without dataset cardinality, number of test specifications, baseline implementation details, or any statistical significance tests. These omissions make it impossible to judge whether the observed deltas are reliable or could be explained by sampling variance.

Authors: We agree that these details are necessary to assess result reliability. In the revised manuscript we will expand the abstract and §4 to report the exact dataset cardinality (training and test specification counts), full baseline implementation details (including prompting strategies and model versions), and statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) to confirm the improvements exceed sampling variance. revision: yes
Referee: [§3.2 and §5] §3.2 (Testing Phase) and §5 (Results): the claim that operator-aligned retrieval produces functionally correct rules for completely unseen specifications rests on the untested assumption that surface-level operator matching preserves semantic equivalence. No out-of-distribution split by operator novelty, no ablation that removes specific operators from training, and no independent model-checking verification of generated assertions (beyond the reported aggregate metrics) are provided.

Authors: Operator alignment follows the functional taxonomy in which each operator defines a distinct semantic category; surface matching therefore corresponds to category-level equivalence by construction. To strengthen the claim we will add an OOD split by operator novelty and an ablation that removes specific operators from training. We will also clarify that functional correctness is obtained via independent model checking of each generated assertion against its specification (using standard FV tools), not solely aggregate metrics. revision: partial
Referee: [§3.1] §3.1 (OP-Tree Construction): the training procedure selects only reasoning paths that yield correct assertions, but the manuscript supplies no quantitative characterization of how many paths are discarded, how operator coverage in the retained tree compares to the test distribution, or whether the tree can represent operator combinations absent from training.

Authors: We will revise §3.1 to include quantitative characterization of OP-Tree construction: the fraction of reasoning paths retained after correctness filtering, operator-coverage statistics comparing the retained tree to the test distribution, and an analysis showing how the hierarchical structure encodes unseen operator combinations via shared sub-paths. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical retrieval on unseen specs is self-contained

full rationale

The paper constructs an OP-Tree during training by decomposing NL-to-SVA into operator-aware paths that produce correct assertions, then uses operator-aligned retrieval at test time to generate rules for unseen specifications. Reported gains (3.95% syntax correctness, 31.17% functional correctness, 70.33% failure reduction) are measured via direct comparison against baselines on held-out data, not by any equation or definition that reduces the output to a fitted input or self-citation. No mathematical derivations, self-definitional loops, or load-bearing prior-author uniqueness theorems appear in the text. The method therefore remains externally falsifiable on new NL specs and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework assumes that decomposing NL-to-SVA tasks into operator-aware questions yields reusable reasoning paths that generalize to unseen specifications. No explicit free parameters or invented physical entities are stated.

axioms (1)

domain assumption LLMs benefit from structured operator-aware reasoning paths when generating SVA
Invoked in the training phase that constructs the OP-Tree from correct assertion examples.

invented entities (1)

Operator Reasoning Tree (OP-Tree) no independent evidence
purpose: Models SVA generation as structured, interpretable reasoning decomposed by operators
New structure introduced to organize reasoning paths; no external falsifiable evidence provided in abstract.

pith-pipeline@v0.9.0 · 5641 in / 1256 out tokens · 33596 ms · 2026-05-15T14:27:56.401190+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 3 internal anchors

[1]

H. Foster. (2021) Part 8: The 2020 wilson research group functional verification study. Siemens EDA. Accessed: 2024-07-25. [Online]. Avail- able: https://blogs.sw.siemens.com/verificationhorizons/2021/01/06/part- 8-the-2020-wilson-research-group-functional-verification-study/

work page 2021
[2]

Formal verification methods,

O. Hasan and S. Tahar, “Formal verification methods,” inEncyclopedia of Information Science and Technology, Third Edition. IGI global, 2015, pp. 7162–7170

work page 2015
[3]

Seligman, T

E. Seligman, T. Schubert, and M. A. K. Kumar,Formal verification: an essential toolkit for modern VLSI design. Elsevier, 2023

work page 2023
[4]

Vijayaraghavan and M

S. Vijayaraghavan and M. Ramanathan,A practical guide for SystemVer- ilog assertions. Springer Science & Business Media, 2005

work page 2005
[5]

Cerny, S

E. Cerny, S. Dudani, J. Havlicek, D. Korchemnyet al.,SVA: the power of assertions in systemVerilog. Springer, 2015

work page 2015
[6]

Systemverilog assertions design tricks and sva bind files,

C. E. Cummings, “Systemverilog assertions design tricks and sva bind files,” inSNUG, 2009, pp. 1–42

work page 2009
[7]

Introduction to the new accellera open verification library,

H. Foster, K. Larsen, and M. Turpin, “Introduction to the new accellera open verification library,” inDVCon’06: Proceedings of the Design and Verification Conference and exhibition. Citeseer, 2006

work page 2006
[8]

Chipnemo: Domain-adapted llms for chip design,

M. Liuet al., “Chipnemo: Domain-adapted llms for chip design,” arXiv:2311.00176, 2023

work page arXiv 2023
[9]

Llm-assisted generation of hardware assertions,

R. Kande, H. Pearce, B. Tan, B. Dolan-Gavitt, S. Thakur, R. Karri, and J. Rajendran, “Llm-assisted generation of hardware assertions,”arXiv preprint arXiv:2306.14027, 2023

work page arXiv 2023
[10]

Using llms to facilitate formal verification of rtl,

M. Orenes-Vera, M. Martonosi, and D. Wentzlaff, “Using llms to facilitate formal verification of rtl,”arXiv e-prints, pp. arXiv–2309, 2023

work page 2023
[11]

Using an llm to help with code understanding,

D. Nam, A. Macvean, V . Hellendoorn, B. Vasilescu, and B. Myers, “Using an llm to help with code understanding,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

work page 2024
[12]

When llm-based code generation meets the software development process,

F. Lin, D. J. Kimet al., “When llm-based code generation meets the software development process,”arXiv preprint arXiv:2403.15852, 2024

work page arXiv 2024
[13]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Where do large language models fail when generating code?

Z. Wang, Z. Zhou, D. Song, Y . Huang, S. Chen, L. Ma, and T. Zhang, “Where do large language models fail when generating code?”arXiv preprint arXiv:2406.08731, 2024

work page arXiv 2024
[15]

Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,

W. Fang, M. Li, M. Li, Z. Yan, S. Liu, H. Zhang, and Z. Xie, “Assertllm: Generating and evaluating hardware verification assertions from design specifications via multi-llms,”arXiv preprint arXiv:2402.00386, 2024

work page arXiv 2024
[16]

Opencores: Open source hardware designs,

OpenCores, “Opencores: Open source hardware designs,” OpenCores Website, 2024, accessed: June 28, 2024. [Online]. Available: https: //opencores.org/

work page 2024
[17]

Automatic generation of invariants and intermediate assertions,

N. Bjørner, A. Browne, and Z. Manna, “Automatic generation of invariants and intermediate assertions,”Theoretical Computer Science, vol. 173, no. 1, pp. 49–87, 1997

work page 1997
[18]

Block-based schema-driven assertion generation for functional verification,

A. Hekmatpour and A. Salehi, “Block-based schema-driven assertion generation for functional verification,” in14th Asian Test Symposium (ATS’05). IEEE, 2005, pp. 34–39

work page 2005
[19]

A survey of machine learning applications in functional verification,

D. Yu, H. Foster, and T. Fitzpatrick, “A survey of machine learning applications in functional verification,”DVCon US, 2023

work page 2023
[20]

Fveval: Evaluating llms on hardware formal verification,

M. J. Kang and H. Ren, “Fveval: Evaluating llms on hardware formal verification,” https://github.com/NVlabs/FVEval, 2024

work page 2024
[21]

Assertionbench: A benchmark to evaluate large-language models for assertion generation,

V . Pulavarthi, D. Nandal, S. Dan, and D. Pal, “Assertionbench: A benchmark to evaluate large-language models for assertion generation,” arXiv preprint arXiv:2406.18627, 2024

work page arXiv 2024
[22]

Towards improving verification pro- ductivity with circuit-aware translation of natural language to systemver- ilog assertions,

C. Sun, C. Hahn, and C. Trippel, “Towards improving verification pro- ductivity with circuit-aware translation of natural language to systemver- ilog assertions,” inFirst International Workshop on Deep Learning-aided Verification, 2023

work page 2023
[23]

Chiraag: Chatgpt informed rapid and automated assertion generation,

B. Mali, K. Maddala, S. Reddy, V . Gupta, C. Karfa, and R. Karri, “Chiraag: Chatgpt informed rapid and automated assertion generation,” arXiv preprint arXiv:2402.00093, 2024

work page arXiv 2024
[24]

Llm-guided formal verification coupled with mutation testing,

M. Hassan, S. Ahmadi-Pour, K. Qayyum, C. K. Jha, and R. Drechsler, “Llm-guided formal verification coupled with mutation testing,” in2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024, pp. 1–2

work page 2024
[25]

Laag-rv: Llm assisted assertion generation for rtl design verification,

K. Maddala, B. Mali, and C. Karfa, “Laag-rv: Llm assisted assertion generation for rtl design verification,” in2024 IEEE 8th International Test Conference India (ITC India). IEEE, 2024, pp. 1–6

work page 2024
[27]

Available: https://arxiv.org/abs/2506.21569

[Online]. Available: https://arxiv.org/abs/2506.21569

work page arXiv
[28]

Sqlfixagent: Towards semantic- accurate sql generation via multi-agent collaboration,

J. Cen, J. Liu, Z. Li, and J. Wang, “Sqlfixagent: Towards semantic- accurate sql generation via multi-agent collaboration,”arXiv preprint arXiv:2406.13408, 2024

work page arXiv 2024
[29]

Can large language models transform natural language intent into formal method postconditions?

M. Endres, S. Fakhoury, S. Chakraborty, and S. K. Lahiri, “Can large language models transform natural language intent into formal method postconditions?”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1889–1912, 2024

work page 1912
[30]

Beyond code generation: Assessing code llm maturity with postconditions,

F. He, J. Zhai, and M. Pan, “Beyond code generation: Assessing code llm maturity with postconditions,”arXiv preprint arXiv:2407.14118, 2024

work page arXiv 2024
[31]

Code agents are state of the art software testers,

N. M ¨undler, M. N. M ¨uller, J. He, and M. Vechev, “Code agents are state of the art software testers,”arXiv preprint arXiv:2406.12952, 2024

work page arXiv 2024
[32]

Effective large language model debugging with best-first tree search,

J. Song, J. Raiman, and B. Catanzaro, “Effective large language model debugging with best-first tree search,”arXiv preprint arXiv:2407.19055, 2024

work page arXiv 2024
[33]

Thinkrepair: Self- directed automated program repair,

X. Yin, C. Ni, S. Wang, Z. Li, L. Zeng, and X. Yang, “Thinkrepair: Self- directed automated program repair,”arXiv preprint arXiv:2407.20898, 2024

work page arXiv 2024
[34]

Large language model (llm) for standard cell layout design optimization,

C.-T. Ho and H. Ren, “Large language model (llm) for standard cell layout design optimization,”arXiv preprint arXiv:2406.06549, 2024

work page arXiv 2024
[35]

Introducing claude sonnet 4.5,

Anthropic PBC, “Introducing claude sonnet 4.5,” https://www.anthropic. com/news/claude-sonnet-4-5, Sep. 2025, accessed: 2025-11-06

work page 2025
[37]

Deepseek-coder- v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931,

[Online]. Available: https://arxiv.org/abs/2406.11931

work page arXiv
[38]

Rtlfixer: Automatically fixing rtl syntax errors with large language models,

Y . Tsaiet al., “Rtlfixer: Automatically fixing rtl syntax errors with large language models,”arXiv:2311.16543, 2023

work page arXiv 2023
[39]

Verilogcoder: Autonomous verilog coding agents with graph-based planning and ast-based waveform tracing tool,

C.-T. Hoet al., “Verilogcoder: Autonomous verilog coding agents with graph-based planning and ast-based waveform tracing tool,” inAAAI,

work page
[40]

Available: https://doi.org/10.1609/aaai.v39i1.32007

[Online]. Available: https://doi.org/10.1609/aaai.v39i1.32007

work page doi:10.1609/aaai.v39i1.32007
[41]

Schemacoder: Automatic log schema extraction coder with residual q-tree boosting,

J. Wan, C.-T. Ho, R. Liang, C. Yu, D. Chen, and H. Ren, “Schemacoder: Automatic log schema extraction coder with residual q-tree boosting,” arXiv preprint arXiv:2508.18554, 2025, preprint. URL: https://arxiv.org/ abs/2508.18554

work page arXiv 2025
[42]

Chateda: A large language model powered autonomous agent for eda,

Z. Heet al., “Chateda: A large language model powered autonomous agent for eda,” inMLCAD, 2023

work page 2023
[43]

Internal consistency and self-feedback in large language models: A survey,

X. Liang, S. Song, Z. Zheng, H. Wang, Q. Yu, X. Li, R.-H. Li, F. Xiong, and Z. Li, “Internal consistency and self-feedback in large language models: A survey,”arXiv preprint arXiv:2407.14507, 2024

work page arXiv 2024
[44]

Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies

L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y . Wang, “Au- tomatically correcting large language models: Surveying the landscape of diverse self-correction strategies,”arXiv preprint arXiv:2308.03188, 2023

work page arXiv 2023
[45]

When can llms actually correct their own mistakes? a critical survey of self-correction of llms,

R. Kamoi, Y . Zhang, N. Zhang, J. Han, and R. Zhang, “When can llms actually correct their own mistakes? a critical survey of self-correction of llms,”arXiv preprint arXiv:2406.01297, 2024

work page arXiv 2024
[46]

Training llms to better self-debug and explain code,

N. Jiang, X. Li, S. Wang, Q. Zhou, S. B. Hossain, B. Ray, V . Kumar, X. Ma, and A. Deoras, “Training llms to better self-debug and explain code,”arXiv preprint arXiv:2405.18649, 2024

work page arXiv 2024
[47]

Self-refine: Iter- ative refinement with self-feedback,

A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in Neural Information Processing Systems, vol. 36, 2023

work page 2023
[48]

Towards llm-powered verilog rtl assistant: Self-verification and self-correction,

H. Huang, Z. Lin, Z. Wang, X. Chen, K. Ding, and J. Zhao, “Towards llm-powered verilog rtl assistant: Self-verification and self-correction,” arXiv preprint arXiv:2406.00115, 2024

work page arXiv 2024
[49]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in Neural Information Processing Systems, vol. 36, 2024

work page 2024
[50]

Learning-from-mistakes prompting for indigenous language translation,

Y .-C. Liao, C.-J. Yu, C.-Y . Lin, H.-F. Yun, Y .-H. Wang, H.-M. Li, and Y .-C. Fan, “Learning-from-mistakes prompting for indigenous language translation,”arXiv preprint arXiv:2407.13343, 2024

work page arXiv 2024
[51]

Recursive introspection: Teaching language model agents how to self-improve,

Y . Qu, T. Zhang, N. Garg, and A. Kumar, “Recursive introspection: Teaching language model agents how to self-improve,”arXiv preprint arXiv:2407.18219, 2024

work page arXiv 2024
[52]

Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning,

Y . Tong, D. Li, S. Wang, Y . Wang, F. Teng, and J. Shang, “Can llms learn from previous mistakes? investigating llms’ errors to boost for reasoning,”arXiv preprint arXiv:2403.20046, 2024

work page arXiv 2024
[53]

In-context principle learning from mistakes,

T. Zhang, A. Madaan, L. Gao, S. Zheng, S. Mishra, Y . Yang, N. Tandon, and U. Alon, “In-context principle learning from mistakes,”arXiv preprint arXiv:2402.05403, 2024

work page arXiv 2024
[54]

Retrieved in-context principles from previous mistakes,

H. Sun, Y . Jiang, B. Wang, Y . Hou, Y . Zhang, P. Xie, and F. Huang, “Retrieved in-context principles from previous mistakes,”arXiv preprint arXiv:2407.05682, 2024

work page arXiv 2024
[55]

Empowering large language models to set up a knowledge retrieval indexer via self-learning,

X. Liang, S. Niu, S. Zhang, S. Song, H. Wang, J. Yang, F. Xiong, B. Tang, C. Xiet al., “Empowering large language models to set up a knowledge retrieval indexer via self-learning,”arXiv preprint arXiv:2405.16933, 2024

work page arXiv 2024
[56]

Collaborative evolving strategy for automatic data- centric development,

X. Yang, H. Chen, W. Feng, H. Wang, Z. Ye, X. Shen, X. Yang, S. Sun, W. Liu, and J. Bian, “Collaborative evolving strategy for automatic data- centric development,”arXiv preprint arXiv:2407.18690, 2024

work page arXiv 2024
[57]

Large language models as optimizers,

C. Yang, X. Wang, Y . Lu, H. Liu, Q. V . Le, D. Zhou, and X. Chen, “Large language models as optimizers,” 2024

work page 2024
[58]

Offline training of language model agents with functions as learnable weights,

S. Zhang, J. Zhang, J. Liu, L. Song, C. Wang, R. Krishna, and Q. Wu, “Offline training of language model agents with functions as learnable weights,” inForty-first International Conference on Machine Learning, 2024

work page 2024
[59]

Symbolic learning enables self-evolving agents,

W. Zhou, Y . Ou, S. Ding, L. Li, J. Wu, T. Wang, J. Chen, S. Wang, X. Xu, N. Zhanget al., “Symbolic learning enables self-evolving agents,”arXiv preprint arXiv:2406.18532, 2024

work page arXiv 2024
[60]

TextGrad: Automatic "Differentiation" via Text

M. Yuksekgonul, F. Bianchi, J. Boen, S. Liu, Z. Huang, C. Guestrin, and J. Zou, “Textgrad: Automatic” differentiation” via text,”arXiv preprint arXiv:2406.07496, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

work page 2023
[62]

Cadence jaspergold formal verification platform,

Cadence Design Systems, “Cadence jaspergold formal verification platform,” https://www.cadence.com/en US/home/tools/system-design- and-verification/formal-and-static-verification/jasper-gold-verification- platform.html, 2024, accessed: 2024-07-31

work page 2024
[63]

A simple ensemble strategy for llm inference: Towards more stable text classification,

J. Niimi, “A simple ensemble strategy for llm inference: Towards more stable text classification,” inInternational Conference on Applications of Natural Language to Information Systems. Springer, 2025, pp. 189– 199

work page 2025
[64]

GPT-4 Technical Report

OpenAIet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Openai o3-mini,

OpenAI, “Openai o3-mini,” https://openai.com/index/openai-o3-mini/, Jan. 2025, accessed: 2025-11-06

work page 2025