pith. machine review for the scientific record. sign in

arxiv: 2603.25111 · v2 · submitted 2026-03-26 · 💻 cs.LG · cs.PL· cs.SE

Recognition: 2 theorem links

· Lean Theorem

SEVerA: Verified Synthesis of Self-Evolving Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:31 UTC · model grok-4.3

classification 💻 cs.LG cs.PLcs.SE
keywords self-evolving agentsformal verificationconstrained learningLLM agentsfirst-order logic contractsprogram synthesissafety guarantees
0
0 comments X

The pith

SEVerA embeds first-order logic contracts into self-evolving agents to guarantee zero violations while raising task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agent synthesis can be treated as constrained learning where hard formal specifications are paired with soft task objectives. It introduces Formally Guarded Generative Models that let a planner LLM write first-order logic contracts around each model call, enforced by rejection sampling plus a verified fallback. The resulting three-stage pipeline searches for candidate programs, proves the contracts hold for every parameter value, and then optimizes the remaining objective with gradient methods. Experiments on program verification, symbolic mathematics, and compliant tool use show the method eliminates all violations while exceeding both unconstrained agents and prior state-of-the-art baselines. A reader should care because autonomous agents that run on unseen inputs require enforceable safety without sacrificing capability.

Core claim

SEVerA formulates agentic code generation as constrained learning and introduces Formally Guarded Generative Models that wrap each generative call in a rejection sampler backed by a verified fallback, ensuring every output satisfies a first-order logic contract for any input and parameter setting. The three-stage framework searches for candidate parametric programs containing such guarded calls, verifies correctness with respect to the hard constraints for all parameter values, and finally applies scalable gradient-based optimization to improve the soft objective while the verified contracts remain intact.

What carries the argument

Formally Guarded Generative Models (FGGM), which attach a first-order logic output contract to each generative model call and enforce it through rejection sampling plus a verified fallback.

If this is right

  • Zero constraint violations on Dafny verification, symbolic math synthesis, and policy-compliant tool use.
  • Performance gains over both unconstrained self-evolving agents and current state-of-the-art baselines.
  • Formal contracts steer the search toward higher-quality agents rather than merely restricting them.
  • After verification the remaining problem reduces to ordinary unconstrained optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same guarding technique could be applied to other generative systems that produce executable code or plans.
  • If verification scales, the approach might support longer-horizon agent behaviors that currently lack safety assurances.
  • Success hinges on whether first-order logic remains expressive enough for the contracts that real deployments require.

Load-bearing premise

A planner LLM can write first-order logic contracts that accurately capture desired behavior and that verification can prove hold for every possible parameter value without prohibitive cost.

What would settle it

A synthesized agent that violates one of its declared contracts on at least one input, or a verification step that fails to certify correctness for all parameter values within feasible compute limits.

Figures

Figures reproduced from arXiv: 2603.25111 by Changming Xu, Chu-Cheng Lin, Daiyi Peng, Debangshu Banerjee, Eugene Ie, Gagandeep Singh, Ming Zhang.

Figure 1
Figure 1. Figure 1: Overview of SEVerA, which operates in three key steps. (1) [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Auto-synthesized guarded GM with rejection sampler (1) and fallback (2). 4.2.2 Examples. We show a couple of example FGGM definitions that the planner LLM suggests: one from sym￾bolic regression and one from Dafny program verification. The first example from symbolic regression (Eq. 13) with id 𝑏𝑜𝑢𝑛𝑑𝑒𝑑𝑃𝑎𝑟𝑎𝑚, type signature (𝑙 : 𝑟𝑒𝑎𝑙, 𝑢 : 𝑟𝑒𝑎𝑙) → 𝑟𝑒𝑎𝑙, underlying parametric GM which is a neural net￾work 𝑁 𝑁… view at source ↗
Figure 3
Figure 3. Figure 3: Two example candidate parametric pro￾grams generated for symbolic regression. (1) Search: The planner LLM defines FGGMs with their prompting programs 𝑓𝑝 , then samples candi￾date parametric programs that invoke these FGGMs. Direct calls to parametric GMs are disallowed; each must be wrapped within a verified FGGM. (2) Verify: SEVerA checks all FGGM definitions for well-formedness and then verifies the cand… view at source ↗
Figure 4
Figure 4. Figure 4: FGGM definition for 𝑖𝑛𝑖𝑡𝑖𝑎𝑙𝐹𝐺𝐺𝑀. The prompting function builds context-specific guidance by checking for lemmas, generics, sets, and multisets in the base program. The fallback returns the original program, satisfying Ψ𝑙 by the reflexivity axiom. , Vol. 1, No. 1, Article . Publication date: April 2026 [PITH_FULL_IMAGE:figures/full_fig_p032_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FGGM definition for 𝑑𝑖 𝑓 𝑓 𝐸𝑟𝑟𝑜𝑟𝐹𝐺𝐺𝑀. Invoked when the previous iteration’s output failed the diff checker. The prompt includes the diff error and previous attempt, instructing the model to preserve the base program logic. , Vol. 1, No. 1, Article . Publication date: April 2026 [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FGGM definition for𝑣𝑒𝑟𝑖 𝑓 𝑖𝑒𝑟𝐸𝑟𝑟𝑜𝑟𝐹𝐺𝐺𝑀. Invoked when the previous iteration passed the diff checker but failed verification. The prompting function analyzes the verification error to provide targeted guidance (timeout, lemma, invariant, or type issues). , Vol. 1, No. 1, Article . Publication date: April 2026 [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Verified agent program for Dafny annotation synthesis. On the first iteration, [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Planner LLM prompt template. The planner receives this prompt at each search–verify–learn iteration, [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗
read the original abstract

Recent advances have shown the effectiveness of self-evolving LLM agents on tasks such as program repair and scientific discovery. In this paradigm, a planner LLM synthesizes an agent program that invokes parametric models, including LLMs, which are then tuned per task to improve performance. However, existing self-evolving agent frameworks provide no formal guarantees of safety or correctness. Because such programs are often executed autonomously on unseen inputs, this lack of guarantees raises reliability and security concerns. We formulate agentic code generation as a constrained learning problem, combining hard formal specifications with soft objectives capturing task utility. We introduce Formally Guarded Generative Models (FGGM), which allow the planner LLM to specify a formal output contract for each generative model call using first-order logic. Each FGGM call wraps the underlying model in a rejection sampler with a verified fallback, ensuring every returned output satisfies the contract for any input and parameter setting. Building on FGGM, we present SEVerA (Self-Evolving Verified Agents), a three-stage framework: Search synthesizes candidate parametric programs containing FGGM calls; Verification proves correctness with respect to hard constraints for all parameter values, reducing the problem to unconstrained learning; and Learning applies scalable gradient-based optimization, including GRPO-style fine-tuning, to improve the soft objective while preserving correctness. We evaluate SEVerA on Dafny program verification, symbolic math synthesis, and policy-compliant agentic tool use ($\tau^2$-bench). Across tasks, SEVerA achieves zero constraint violations while improving performance over unconstrained and SOTA baselines, showing that formal behavioral constraints not only guarantee correctness but also steer synthesis toward higher-quality agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SEVerA, a three-stage framework (Search, Verification, Learning) for synthesizing self-evolving LLM agents with formal guarantees. It defines Formally Guarded Generative Models (FGGM) that wrap parametric model calls (including LLMs) in first-order logic contracts enforced by rejection sampling plus verified fallbacks, claiming this ensures zero constraint violations for any input and any parameter values. Verification reduces the problem to unconstrained learning, after which gradient-based optimization (including GRPO-style fine-tuning) improves task utility. Experiments on Dafny program verification, symbolic math synthesis, and policy-compliant tool use report zero violations together with gains over unconstrained and SOTA baselines.

Significance. If the zero-violation guarantee can be shown to hold at practical cost without restricting expressiveness, the work would provide a concrete route to formally safe self-evolving agents, combining hard constraints with scalable learning in a way that could influence both verification and agent-synthesis research.

major comments (3)
  1. [Abstract / FGGM definition] Abstract and FGGM construction: the guarantee that every output satisfies the FOL contract for arbitrary inputs and parameter settings rests on the rejection sampler terminating or the fallback being proven correct independently of the model; no termination argument, complexity bound, or fallback proof for LLM generators is supplied.
  2. [Verification stage] Verification stage: the claim that verification reduces the synthesis problem to unconstrained learning is central, yet the manuscript supplies no concrete argument, termination condition, or cost analysis showing the reduction succeeds without excessive overhead or loss of expressiveness for the three evaluated domains.
  3. [Evaluation section] Evaluation: performance gains are reported over unconstrained and SOTA baselines, but exact baseline implementations, fallback invocation frequencies, and verification scalability metrics (e.g., time or success rate per task) are not detailed, leaving the practical magnitude of the claimed improvements difficult to assess.
minor comments (2)
  1. [FGGM / SEVerA framework] Clarify notation for the FOL contracts and the precise interface between the planner LLM and the FGGM wrapper.
  2. [Evaluation] Add standard deviations or statistical tests for the reported performance improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the formal guarantees, verification reduction, and evaluation details. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / FGGM definition] Abstract and FGGM construction: the guarantee that every output satisfies the FOL contract for arbitrary inputs and parameter settings rests on the rejection sampler terminating or the fallback being proven correct independently of the model; no termination argument, complexity bound, or fallback proof for LLM generators is supplied.

    Authors: We agree the manuscript does not supply an explicit termination argument or complexity bound for the rejection sampler with LLM generators. The FGGM design relies on a bounded number of rejection attempts (with timeout triggering the verified fallback) and assumes the fallback is proven correct independently of the generator. We will add a new subsection under FGGM definition providing the termination condition (fixed attempt limit plus fallback), a brief complexity discussion (linear in attempts, with practical termination rates observed), and note that the overall zero-violation guarantee holds because the fallback is verified. This addresses the gap without altering the core claim. revision: yes

  2. Referee: [Verification stage] Verification stage: the claim that verification reduces the synthesis problem to unconstrained learning is central, yet the manuscript supplies no concrete argument, termination condition, or cost analysis showing the reduction succeeds without excessive overhead or loss of expressiveness for the three evaluated domains.

    Authors: The verification stage proves that each FGGM-wrapped program satisfies its FOL contracts for all parameter values using sound verifiers (e.g., Dafny for the first domain), which decouples hard constraints from soft-objective optimization. We will expand the Verification section with a concrete argument (inductive proof over FGGM calls), termination condition (verifier completeness for the supported FOL fragment), and cost analysis (reporting average verification times and success rates per domain, showing overhead remains practical without restricting expressiveness). revision: yes

  3. Referee: [Evaluation section] Evaluation: performance gains are reported over unconstrained and SOTA baselines, but exact baseline implementations, fallback invocation frequencies, and verification scalability metrics (e.g., time or success rate per task) are not detailed, leaving the practical magnitude of the claimed improvements difficult to assess.

    Authors: We acknowledge the need for greater detail on baselines and metrics. We will add a dedicated evaluation subsection specifying exact baseline implementations (unconstrained agents omit FGGM guards entirely; SOTA baselines follow published code where available), report fallback invocation frequencies (observed to be low across runs), and include scalability metrics such as per-task verification time, success rate, and wall-clock overhead. These additions will allow readers to better assess the magnitude of improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; verification reduces to unconstrained learning via external formal methods without tautological reduction of performance claims.

full rationale

The paper's derivation chain is self-contained: FGGM wraps models in rejection samplers plus verified fallbacks to enforce FOL contracts for all inputs/parameters, then Verification stage uses external tools (e.g., Dafny) to prove correctness and reduce to unconstrained optimization for the soft objective. No equations or self-citations make claimed zero-violation + performance gains equivalent to inputs by construction; the separation of hard constraints from learning is explicit and externally verifiable. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results appear in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The framework rests on the assumption that first-order logic contracts can be specified and verified for generative model outputs, plus standard optimization assumptions; new entities FGGM and SEVerA are introduced without independent evidence beyond the paper's claims.

free parameters (1)
  • task-specific model parameters
    Parameters of invoked LLMs and other models are tuned during the Learning stage to optimize soft objectives.
axioms (2)
  • domain assumption First-order logic suffices to express output contracts for the generative models used
    Invoked when defining FGGM calls in the Search stage.
  • domain assumption Verification can establish correctness for all parameter values after the Search stage
    Central premise enabling reduction to unconstrained learning.
invented entities (2)
  • Formally Guarded Generative Model (FGGM) no independent evidence
    purpose: Wraps generative models with rejection sampling and verified fallback to enforce FOL contracts
    Newly introduced construct to add formal guarantees to LLM calls.
  • SEVerA three-stage framework no independent evidence
    purpose: Synthesizes verified self-evolving agents via Search, Verification, and Learning
    The overall pipeline is proposed in this work.

pith-pipeline@v0.9.0 · 5615 in / 1611 out tokens · 43343 ms · 2026-05-15T00:31:00.879421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

    cs.AI 2026-04 conditional novelty 6.0

    The Experience Compression Spectrum unifies memory, skills, and rules in LLM agents along increasing compression levels and identifies the absence of adaptive cross-level compression as the missing diagonal.

Reference graph

Works this paper leans on

55 extracted references · 34 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In2013 Formal Methods in Computer-Aided Design. 1–8. doi:10.1109/FMCAD.2013.6679385

  2. [2]

    Anthropic. 2025. System Card: Claude Sonnet 4.5. https://www-cdn.anthropic.com/ 963373e433e489a87a10c823c52a0a013e9172dd.pdf. Accessed: 2026-03-18

  3. [3]

    Debangshu Banerjee, Olivier Bouissou, and Stefan Zetzsche. 2026. DafnyPro: LLM-Assisted Automated Verification for Dafny Programs. arXiv:2601.05385 [cs.SE] https://arxiv.org/abs/2601.05385

  4. [4]

    Debangshu Banerjee, Tarun Suresh, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. 2025. CRANE: Reasoning with constrained LLM generation. InForty-second International Conference on Machine Learning. https://openreview. net/forum?id=wKs9fHYxCV

  5. [5]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. 2025.𝜏 2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv:2506.07982 [cs.AI] https://arxiv.org/abs/2506.07982

  6. [6]

    Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. 2023. Prompting Is Programming: A Query Language for Large Language Models.Proc. ACM Program. Lang.7, PLDI, Article 186 (June 2023), 24 pages. doi:10.1145/3591300

  7. [7]

    Iwo Błądek and Krzysztof Krawiec. 2019. Solving symbolic regression problems with formal constraints. InProceedings of the Genetic and Evolutionary Computation Conference(Prague, Czech Republic)(GECCO ’19). Association for Computing Machinery, New York, NY, USA, 977–984. doi:10.1145/3321707.3321743

  8. [8]

    Miles Cranmer. 2023. Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl. arXiv:2305.01582 [astro-ph.IM] https://arxiv.org/abs/2305.01582

  9. [9]

    Dafny Language Community. 2023. Integrating Dafny and Python Code. https://dafny.org/v3.10.0/DafnyRef/ integration-py/IntegrationPython. Accessed: 2026-03-17

  10. [10]

    Tenenbaum

    Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B. Tenenbaum. 2021. DreamCoder: bootstrapping inductive program synthesis with wake- sleep library learning. InProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation(V...

  11. [11]

    Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. 2025. WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model.arXiv preprint arXiv:2504.21024(2025)

  12. [12]

    Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-based synthesis of table consolidation and transformation tasks from examples. InProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation(Barcelona, Spain)(PLDI 2017). Association for Computing Machinery, New York, NY, USA, ...

  13. [13]

    Hongcheng Gao, Yue Liu, Yufei He, Longxu Dou, Chao Du, Zhijie Deng, Bryan Hooi, Min Lin, and Tianyu Pang. 2025. FlowReasoner: Reinforcing Query-Level Meta-Agents. arXiv:2504.15257 [cs.AI] https://arxiv.org/abs/2504.15257

  14. [14]

    The Guardian. 2026. Rogue AI agents published passwords and bypassed security protections. News investigation

  15. [15]

    Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. 2024. RedCode: Risky Code Execution and Generation Benchmark for Code Agents. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=mAG68wdggA

  16. [16]

    Foster, and David Van Horn

    Sankha Narayan Guria, Jeffrey S. Foster, and David Van Horn. 2023. Absynthe: Abstract Interpretation-Guided Synthesis.Proc. ACM Program. Lang.7, PLDI, Article 171 (June 2023), 24 pages. doi:10.1145/3591285

  17. [17]

    Christian Haider and Gabriel Kronberger. 2022. Shape-Constrained Symbolic Regression with NSGA-III. InComputer Aided Systems Theory – EUROCAST 2022: 18th International Conference, Las Palmas de Gran Canaria, Spain, February 20–25, 2022, Revised Selected Papers(Las Palmas de Gran Canaria, Spain). Springer-Verlag, Berlin, Heidelberg, 164–172. doi:10.1007/97...

  18. [18]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3

  19. [19]

    Shengran Hu, Cong Lu, and Jeff Clune. 2025. Automated Design of Agentic Systems. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=t9U3LW7JVX

  20. [20]

    Adharsh Kamath, Sishen Zhang, Calvin Xu, Shubham Ugare, Gagandeep Singh, and Sasa Misailovic. 2025. Enforcing Temporal Constraints for LLM Agents. arXiv:2512.23738 [cs.PL] https://arxiv.org/abs/2512.23738

  21. [21]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

  22. [22]

    InThe Twelfth International Conference on Learning Representations

    DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=sY5N0zY5Od

  23. [23]

    Jinwoo Kim, Qinheping Hu, Loris D’Antoni, and Thomas Reps. 2021. Semantics-guided synthesis.Proc. ACM Program. Lang.5, POPL, Article 30 (Jan. 2021), 32 pages. doi:10.1145/3434311 , Vol. 1, No. 1, Article . Publication date: April 2026. 26 Banerjee et al

  24. [24]

    Tristan Knoth, Di Wang, Nadia Polikarpova, and Jan Hoffmann. 2019. Resource-guided program synthesis. InProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation(Phoenix, AZ, USA)(PLDI 2019). Association for Computing Machinery, New York, NY, USA, 253–268. doi:10.1145/3314221.3314602

  25. [25]

    Kronberger, F

    G. Kronberger, F. O. de Franca, B. Burlacu, C. Haider, and M. Kommenda. 2022. Shape-Constrained Symbolic Regression—Improving Extrapolation with Prior Knowledge.Evolutionary Computation30, 1 (03 2022), 75–98. arXiv:https://direct.mit.edu/evco/article-pdf/30/1/75/1995582/evco_a_00294.pdf doi:10.1162/evco_a_00294

  26. [26]

    Eunkyu Lee, Donghyeon Kim, Wonyoung Kim, and Insu Yun. 2025. Takedown: How It’s Done in Modern Coding Agent Exploits. arXiv:2509.24240 [cs.CR] https://arxiv.org/abs/2509.24240

  27. [27]

    Rustan M

    K. Rustan M. Leino. 2010. Dafny: an automatic program verifier for functional correctness. InProceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning(Dakar, Senegal)(LPAR’10). Springer-Verlag, Berlin, Heidelberg, 348–370

  28. [28]

    Li Li, Minjie Fan, Rishabh Singh, and Patrick Riley. 2019. Neural-Guided Symbolic Regression with Asymptotic Constraints. arXiv:1901.07714 [cs.LG] https://arxiv.org/abs/1901.07714

  29. [29]

    Wenqiang Li, Weijun Li, Linjun Sun, Min Wu, Lina Yu, Jingyi Liu, Yanjie Li, and Songsong Tian. 2023. Transformer- based model for symbolic regression via joint supervised learning. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=ULzyv9M1j5

  30. [30]

    Chloe R Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, and Max Tegmark. 2025. DafnyBench: A Benchmark for Formal Software Verification.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=yBgTVWccIx

  31. [31]

    Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2024. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229(2024)

  32. [32]

    Eric Mugnier, Emmanuel Anaya Gonzalez, Nadia Polikarpova, Ranjit Jhala, and Zhou Yuanyuan. 2025. Laurel: Unblocking Automated Verification with Large Language Models.Proc. ACM Program. Lang.9, OOPSLA1, Article 134 (April 2025), 27 pages. doi:10.1145/3720499

  33. [33]

    Niels Mündler, Jingxuan He, Hao Wang, Koushik Sen, Dawn Song, and Martin Vechev. 2025. Type-Constrained Code Generation with Language Models.Proc. ACM Program. Lang.9, PLDI, Article 171 (June 2025), 26 pages. doi:10.1145/3729274

  34. [34]

    Shaan Nagy, Timothy Zhou, Nadia Polikarpova, and Loris D’Antoni. 2026. ChopChop: A Programmable Framework for Semantically Constraining the Output of Language Models.Proc. ACM Program. Lang.10, POPL, Article 66 (Jan. 2026), 28 pages. doi:10.1145/3776708

  35. [35]

    Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. 2016. Neuro-Symbolic Program Synthesis. arXiv:1611.01855 [cs.AI] https://arxiv.org/abs/1611.01855

  36. [36]

    Kanghee Park, Jiayu Wang, Taylor Berg-Kirkpatrick, Nadia Polikarpova, and Loris D’Antoni. 2024. Grammar-Aligned Decoding. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=5G7ve8E1Lu

  37. [37]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

  38. [38]

    Parshin Shojaee, Kazem Meidani, Shashank Gupta, Amir Barati Farimani, and Chandan K. Reddy. 2025. LLM-SR: Scientific Equation Discovery via Programming with Large Language Models. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=m2nmp8P5in

  39. [39]

    Armando Solar-Lezama. 2013. Program Sketching.International Journal on Software Tools for Technology Transfer15, 5 (2013), 475–495. doi:10.1007/s10009-012-0249-7

  40. [40]

    Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. 2024. Clover: Closed-Loop Verifiable Code Generation. InAI Verification: First International Symposium, SAIV 2024, Montreal, QC, Canada, July 22–23, 2024, Proceedings(Montreal, QC, Canada). Springer-Verlag, Berlin, Heidelberg, 134–155. doi:10.1007/978-3-031-65112-0_7

  41. [41]

    Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh. 2025. DINGO: Constrained Inference for Diffusion LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https: //openreview.net/forum?id=KaYMGsnZ4R

  42. [42]

    Dídac Surís, Sachit Menon, and Carl Vondrick. 2023. ViperGPT: Visual Inference via Python Execution for Reasoning. In2023 IEEE/CVF International Conference on Computer Vision (ICCV). 11854–11864. doi:10.1109/ICCV51070.2023.01092

  43. [43]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  44. [44]

    Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Chunkai Fan, Junyu Lu, Yulin Luo, Qiang Zhou, Yiming Zhao, Ning Liu, Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, and Jian Tang. 2025. SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents. arXiv:2506.21669 [cs.AI] https://arxiv.org/abs/2506. 21669 , Vol. 1, No. 1, ...

  45. [45]

    Shubham Ugare, Rohan Gumaste, Tarun Suresh, Gagandeep Singh, and Sasa Misailovic. 2025. IterGen: Iterative Semantic-aware Structured LLM Generation with Backtracking. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=ac93gRzxxV

  46. [46]

    Shubham Ugare, Tarun Suresh, Hangoo Kang, Sasa Misailovic, and Gagandeep Singh. 2025. SynCode: LLM Generation with Grammar Augmentation.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id= HiUZtgAPoH

  47. [47]

    Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, and Swarat Chaudhuri. 2018. HOUDINI: lifelong learning as program synthesis. InProceedings of the 32nd International Conference on Neural Information Processing Systems(Montréal, Canada)(NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 8701–8712

  48. [48]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar

  49. [49]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291 [cs.AI] https: //arxiv.org/abs/2305.16291

  50. [50]

    Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. 2026. Reinforcement Learning for Self-Improving Agent with Skill Library. arXiv:2512.17102 [cs.AI] https://arxiv.org/abs/2512.17102

  51. [51]

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better LLM agents. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2054, 25 pages

  52. [52]

    Ziqian Zhong, Aditi Raghunathan, and Nicholas Carlini. 2025. ImpossibleBench: Measuring LLMs’ Propensity of Exploiting Test Cases.arXiv preprint arXiv:2510.20270(2025)

  53. [53]

    Wangchunshu Zhou, Yixin Ou, Shengwei Ding, Long Li, Jialong Wu, Tiannan Wang, Jiamin Chen, Shuai Wang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, and Yuchen Eleanor Jiang. 2024. Symbolic Learning Enables Self-Evolving Agents. arXiv:2406.18532 [cs.CL] https://arxiv.org/abs/2406.18532

  54. [54]

    He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. 2019. An inductive synthesis framework for verifiable reinforcement learning. InProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation(Phoenix, AZ, USA)(PLDI 2019). Association for Computing Machinery, New York, NY, USA, 686–701. doi:10.1145/3314221.3314638

  55. [55]

    ”⇐ ⇒𝑛𝑜𝐷𝑖 𝑓 𝑓(𝑎, 𝑏). • ∀𝑐∈Σ ∗. 𝑑𝑎𝑓 𝑛𝑦𝑉 𝑒𝑟𝑖 𝑓 𝑖𝑒𝑟𝑊 𝑖𝑡ℎ𝐸𝑟𝑟𝑜𝑟 𝑀𝑠𝑔(𝑐)=“

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. 2024. GPTSwarm: Language Agents as Optimizable Graphs. InForty-first International Conference on Machine Learning. https://openreview.net/forum?id=uTC9AFXIhg , Vol. 1, No. 1, Article . Publication date: April 2026. 28 Banerjee et al. A Restricted Dafny ...