pith. sign in

arxiv: 2605.17475 · v1 · pith:JERJI66Unew · submitted 2026-05-17 · 💻 cs.SE

Event-B Agent: Towards LLM Agent for Formal Model Synthesis and Repair

Pith reviewed 2026-05-19 22:48 UTC · model grok-4.3

classification 💻 cs.SE
keywords Event-BLLM agentformal model synthesismodel repairformal methodsverification feedbackautoformalizationcorrect-by-construction
0
0 comments X

The pith

An LLM agent builds correct Event-B models from natural language by iteratively refining and repairing them with verification feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Event-B Agent as a framework that starts with natural language requirements and produces a formal Event-B model through repeated cycles of construction, refinement, and repair. Formal verification supplies error messages that the LLM uses to fix flaws and simplify proofs at each step. Refinement and repair support each other so that models become easier to verify while staying sound. A reader would care because this process lowers the expertise barrier that currently limits formal methods to specialists. The reported evaluations indicate the agent completes synthesis and repair tasks more effectively than prior methods across models of different sizes.

Core claim

Event-B Agent constructs an initial model from natural language requirements and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step. Together, these two components reinforce each other to progressively improve the model quality. Evaluation across systems of varying complexity demonstrates that Event-B Agent substantially outperforms baselines in end-to-end formal model synthesis and repair, while maintaining reasonable efficiency.

What carries the argument

The interleaved loop of model construction, refinement, and repair in which the LLM translates verification error messages into updates that keep the model correct.

If this is right

  • Formal models can be developed with less manual mathematical intervention while still guaranteeing correctness through verification.
  • Refinement steps become more tractable because repair maintains soundness at every iteration.
  • The combined process supports end-to-end automation from requirements to verified model across systems of different scales.
  • Efficiency remains practical because the agent avoids exhaustive search by focusing on targeted fixes from feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feedback-driven loop could be tested on other formalisms such as TLA+ to check whether the repair-refinement interaction transfers.
  • Adding optional human review only at refinement milestones might increase reliability for safety-critical applications without losing most automation gains.
  • Once a verified Event-B model exists, downstream code generation tools could be applied directly to produce executable implementations.

Load-bearing premise

The LLM can reliably translate formal verification error messages into sound repairs and refinements that preserve model correctness without introducing new inconsistencies or requiring human correction at each step.

What would settle it

Apply the agent to a set of requirements whose verification produces errors that demand non-local structural changes; failure to reach a verified model without external intervention would show the translation step does not hold.

Figures

Figures reproduced from arXiv: 2605.17475 by Hongshu Wang, Jin Song Dong, Qin Li, Xinyue Zuo, Yamine Ait Ameur, Yuhan Sun.

Figure 1
Figure 1. Figure 1: Simplified Event-B models for the motivating example generated by GPT-5, Cursor, and PAT-Agent. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The abstract and concrete formal models constructed by Event-B Agent for the motivating example. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The grammar of Event-B, where 𝐹𝑉 (𝑝𝑟𝑒𝑑) is the set of free variables in 𝑝𝑟𝑒𝑑. We position our work relative to two lines of research. LLM proof assistants assume a fixed model 𝑀 and search for valid 𝜋, but cannot construct 𝑀0 ⇝ 𝑀1 from natural language requirements or revise 𝑀𝑡 during development. In contrast, LLM-based autoformalization methods iteratively construct models 𝑀0 ⇝ · · · ⇝ 𝑀𝑘 without proof-ba… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Event-B Agent. From natural-language requirements ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Workflow of the model & proof repair component. The model is [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evolution of the three metrics across refinement steps [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Building software that is correct by construction is a long-standing goal in software engineering, as it ensures reliability during design and development rather than after deployment. Formal methods realize this vision by enabling the expression of system behavior and requirements in mathematics, thereby guaranteeing correctness through formal verification, including theorem proving and model checking. However, the steep learning curve and demand for mathematical expertise hinder the widespread adoption of formal methods. Large language models (LLMs) have recently shown promise in bridging this gap through autoformalization. However, existing LLM-based approaches are largely limited to isolated tasks, such as theorem proving without formalization or model synthesis with insufficient verification. While valuable, these efforts do not fully exploit the potential of a more comprehensive framework in which models and proofs evolve together, a process that closely reflects real-world development practice. To address this gap, we propose Event-B Agent, a novel framework inspired by the interleaved nature of software design. Given natural language requirements, Event-B Agent constructs an initial model and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step. Together, these two components reinforce each other to progressively improve the model quality. Evaluation across systems of varying complexity demonstrates that Event-B Agent substantially outperforms baselines in end-to-end formal model synthesis and repair, while maintaining reasonable efficiency. These results suggest that Event-B Agent is a promising step toward correct-by-construction formal model synthesis and repair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Event-B Agent, an LLM-based agent framework for synthesizing initial Event-B models from natural language requirements and iteratively refining and repairing them using feedback from formal verification. Refinement simplifies proof discharge while repair ensures soundness at each step; the two are claimed to reinforce each other. Evaluation on systems of varying complexity is reported to show substantial outperformance over baselines in end-to-end synthesis and repair while maintaining reasonable efficiency.

Significance. If the results hold, the work offers a practical step toward lowering the barrier to formal methods by automating the interleaved synthesis-repair-refinement cycle that mirrors real development practice. The explicit use of prover feedback to drive LLM actions distinguishes it from prior isolated autoformalization efforts and could support more accessible correct-by-construction modeling.

major comments (2)
  1. [§5 (Evaluation)] §5 (Evaluation): the claim of 'substantially outperforms baselines' is presented without concrete success rates, baseline definitions, number of trials, or error analysis in the reported results. This makes it impossible to judge whether the outperformance is robust or an artifact of the chosen test cases.
  2. [§4.3 (Repair loop)] §4.3 (Repair loop): formal verification guarantees only that the repaired model satisfies its invariants; the manuscript does not describe any check that LLM-generated repairs continue to satisfy the original natural-language requirements. Without such a safeguard the autonomous end-to-end claim rests on an unverified assumption about LLM fidelity.
minor comments (2)
  1. [Abstract] Abstract: 'reasonable efficiency' is stated without any quantitative bound (e.g., average iterations or wall-clock time); a short clarification would help readers interpret the efficiency claim.
  2. [§3 (Framework)] Notation: the distinction between 'refinement' and 'repair' steps is introduced early but the precise interface between the two (e.g., which artifacts are passed to the LLM) is not summarized in a single diagram or pseudocode listing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and indicate where revisions will be made to improve the manuscript.

read point-by-point responses
  1. Referee: [§5 (Evaluation)] §5 (Evaluation): the claim of 'substantially outperforms baselines' is presented without concrete success rates, baseline definitions, number of trials, or error analysis in the reported results. This makes it impossible to judge whether the outperformance is robust or an artifact of the chosen test cases.

    Authors: We appreciate this observation and agree that additional quantitative detail would strengthen the evaluation. In the revised manuscript we will expand §5 to report concrete success rates for Event-B Agent and all baselines, provide explicit definitions and implementation details for each baseline, state the number of trials run per system, and include an error analysis categorizing failure modes. These additions will allow readers to assess robustness directly. revision: yes

  2. Referee: [§4.3 (Repair loop)] §4.3 (Repair loop): formal verification guarantees only that the repaired model satisfies its invariants; the manuscript does not describe any check that LLM-generated repairs continue to satisfy the original natural-language requirements. Without such a safeguard the autonomous end-to-end claim rests on an unverified assumption about LLM fidelity.

    Authors: This is a fair point. The framework encodes natural-language requirements into the initial Event-B model (invariants, events, and proof obligations) and uses verification feedback to drive repairs that preserve those invariants. However, we acknowledge that the manuscript does not explicitly describe an additional safeguard to re-validate alignment with the original requirements after each LLM-generated repair. In the revision we will add a dedicated paragraph in §4.3 clarifying this assumption, discussing its implications, and outlining a lightweight post-repair validation step that could be incorporated in future extensions. revision: partial

Circularity Check

0 steps flagged

No significant circularity in framework proposal or evaluation

full rationale

The paper introduces Event-B Agent as a novel LLM-driven iterative framework for synthesizing and repairing Event-B models from natural-language requirements, using verification feedback for refinement and repair. The central claims rest on an empirical evaluation across systems of varying complexity that shows outperformance over baselines, with no equations, fitted parameters, uniqueness theorems, or self-citations invoked as load-bearing premises. The derivation chain is self-contained: the framework is defined by construction, and the reported performance is presented as an independent empirical result rather than a statistical or definitional consequence of prior inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that current LLMs can act as reliable repair agents when given formal feedback; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption LLMs can be prompted to generate and iteratively repair formal Event-B models from natural language and verification feedback
    This capability is presupposed by the agent design but is not derived or proven in the abstract.
invented entities (1)
  • Event-B Agent no independent evidence
    purpose: Iterative synthesis and repair of Event-B models
    New named framework introduced by the paper; no independent falsifiable evidence supplied.

pith-pipeline@v0.9.0 · 5809 in / 1228 out tokens · 32841 ms · 2026-05-19T22:48:38.951116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

  1. [1]

    Jean-Raymond Abrial. 1996. The B-book: Assigning Programs to Meaning.Cambridge University Press(1996). doi:10.1017/CBO9780511624162

  2. [2]

    2010.Modeling in Event-B: system and software engineering

    Jean-Raymond Abrial. 2010.Modeling in Event-B: system and software engineering. Cambridge University Press. doi:10.1017/CBO9781139195881

  3. [3]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report.arXiv preprint arXiv:2303.08774 (2023). doi:10.48550/arXiv.2303.08774

  4. [4]

    Eman Alkhammash, Michael Butler, Asieh Salehi Fathabadi, and Corina Cîrstea. 2015. Building traceable Event-B models from requirements.Science of Computer Programming111 (2015), 318–338. doi:10.1016/j.scico.2015.06.002

  5. [5]

    Anysphere, Inc. 2025. Cursor: The AI Code Editor. https://cursor.com/ Accessed: 2025-09-11

  6. [6]

    Clark Barrett, Christopher L Conway, Morgan Deters, Liana Hadarean, Dejan Jovanović, Tim King, Andrew Reynolds, and Cesare Tinelli. 2011. cvc4. InInternational Conference on Computer Aided Verification. Springer, 171–177. doi:10. 1007/978-3-642-22110-1_14

  7. [7]

    Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. 2020. The tactician: A seamless, interactive tactic learner and prover for coq. InInternational Conference on Intelligent Computer Mathematics. Springer, 271–277. doi:10.1007/978-3- 030-53518-6_17

  8. [8]

    Cheng-Hao Cai, Jing Sun, and Gillian Dobbie. 2019. Automatic B-model repair using model checking and machine learning.Automated Software Engineering26, 3 (2019), 653–704. doi:10.1007/s10515-019-00264-4

  9. [9]

    Cheng-Hao Cai, Jing Sun, Gillian Dobbie, Zhé Hóu, Hadrien Bride, Jin Song Dong, and Scott Uk-Jin Lee. 2022. Fast automated abstract machine repair using simultaneous modifications and refactoring.Formal Aspects of Computing34, 2 (2022), 1–31. doi:10.1145/3536430

  10. [10]

    Alfredo Capozucca, Daniil Yampolskyi, Alexander Goldberg, and Maximiliano Cristiá. 2025. Do ai assistants help students write formal specifications? a study with chatgpt and the b-method. In2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, 19–29. doi:10.1109/CSEET66350.2025.00009

  11. [11]

    Pedro Carrott, Nuno Saavedra, Kyle Thompson, Sorin Lerner, João F Ferreira, and Emily First. 2024. CoqPyt: proof navigation in Python in the era of LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 637–641. doi:10.1145/3663529.3663814

  12. [12]

    Swarat Chaudhuri, Kevin Ellis, Oleksandr Polozov, Rishabh Singh, Armando Solar-Lezama, and Yisong Yue. 2021. Neurosymbolic programming.Foundations and Trends in Programming Languages7, 3 (2021), 158–243. doi:10.1561/ 2500000049

  13. [13]

    Mark Chevallier, Filip Smola, Richard Schmoetten, and Jacques D Fleuriot. 2025. Formally Verified Neurosymbolic Trajectory Learning via Tensor-based Linear Temporal Logic on Finite Traces.arXiv preprint arXiv:2501.13712(2025). doi:10.48550/arXiv.2501.13712

  14. [14]

    Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. InInternational Conference on Computer Aided Verification. Springer, 383–396. doi:10.1007/978-3-031-37703-7_18

  15. [15]

    Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340. doi:10.1007/978-3-540-78800-3_24

  16. [16]

    Guillaume Dupont, Yamine Ait-Ameur, Neeraj Kumar Singh, and Marc Pantel. 2021. Event-B hybridation: A proof and refinement-based framework for modelling hybrid systems.ACM Transactions on Embedded Computing Systems (TECS)20, 4 (2021), 1–37. doi:10.1145/3448270

  17. [17]

    Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1229–1241. doi:10.1145/3611643.3616243

  18. [18]

    Francesco Fuggitti and Tathagata Chakraborti. 2023. NL2LTL–a python package for converting natural language (NL) instructions to linear temporal logic (LTL) formulas. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 16428–16430. doi:10.1609/aaai.v37i13.27068

  19. [20]

    Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ramana Kumar, and Michael Norrish. 2021. TacticToe: learning to prove with tactics.Journal of Automated Reasoning65, 2 (2021), 257–286. doi:10.1007/s10817-020-09580-x

  20. [21]

    Chuqin Geng, Zhaoyue Wang, Haolin Ye, and Xujie Si. 2024. Learning Minimal Neural Specifications.arXiv preprint arXiv:2404.04662(2024). doi:10.48550/arXiv.2404.04662

  21. [22]

    Fabian Gloeckle, Baptiste Roziere, Amaury Hayat, and Gabriel Synnaeve. 2023. Temperature-scaled large language models for Lean proofstep prediction. InThe 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23. https: Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE211. Publication date: July 2026. FSE211:22 Hongshu Wang, Xinyue Zuo, Yuhan Sun, Qi...

  22. [23]

    Heinrich Heine University Düsseldorf, Group for Software Engineering and Programming Languages. 2025. ProB Animator and Model Checker. https://prob.hhu.de/ Accessed: 2025-09-11

  23. [24]

    Heinrich Heine University Düsseldorf, Group for Software Engineering and Programming Languages. 2025. Rodin User’s Handbook v.2.8. https://stups.hhu-hosting.de/handbook/rodin/current/html/introduction.html Accessed: 2025-09-11

  24. [25]

    Yang Hong, Shan Jiang, Yulei Fu, and Sarfraz Khurshid. 2025. On the Effectiveness of Large Language Models in Writing Alloy Formulas.arXiv preprint arXiv:2502.15441(2025). doi:10.48550/arXiv.2502.15441

  25. [26]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79. doi:10.1145/3695988

  26. [27]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.arXiv preprint arXiv:2406.00515(2024). doi:10.1145/3747588

  27. [28]

    Tsutomu Kobayashi and Fuyuki Ishikawa. 2024. Repairing Event-B Models Through Quantifier Elimination. In International Conference on Formal Engineering Methods. Springer, 18–36. doi:10.1007/978-981-96-0617-7_2

  28. [29]

    Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, and Anima Anandkumar. 2024. LeanAgent: Lifelong Learning for Formal Theorem Proving.arXiv preprint arXiv:2410.06209(2024). doi:10.48550/arXiv. 2410.06209

  29. [30]

    Haohan Lin, Zhiqing Sun, Sean Welleck, and Yiming Yang. 2024. Lean-star: Learning to interleave thinking and proving. arXiv preprint arXiv:2407.10040(2024). doi:10.48550/arXiv.2407.10040

  30. [31]

    Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation with large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1509–1520. doi:10.1145/3691620. 3695521

  31. [32]

    Atif Mashkoor, Faqing Yang, and Jean-Pierre Jacquot. 2017. Refinement-based validation of Event-B specifications. Software & Systems Modeling16, 3 (2017), 789–808. doi:10.1007/s10270-016-0514-4

  32. [33]

    Dominique Méry and Neeraj Kumar Singh. 2011. Automatic code generation from Event-B models. InProceedings of the 2nd Symposium on Information and Communication Technology. 179–188. doi:10.1145/2069216.2069252

  33. [34]

    Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak, Bartosz Piotrowski, Albert Qiaochu Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, and Yuhuai Wu. 2023. Magnushammer: A transformer-based approach to premise selection.arXiv preprint arXiv:2303.04488(2023). doi:10.48550/arXiv.2303.04488

  34. [35]

    Yutaka Nagashima and Yilun He. 2018. PaMpeR: proof method recommendation system for Isabelle/HOL. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 362–372. doi:10.1145/3238147.3238210

  35. [36]

    Yutaka Nagashima and Ramana Kumar. 2017. A proof strategy language and proof script generation for Isabelle/HOL. InInternational Conference on Automated Deduction. Springer, 528–545. doi:10.1007/978-3-319-63046-5_32

  36. [37]

    OpenAI. 2025. GPT-5. https://openai.com/gpt-5/ Released: 2025-08-07

  37. [38]

    Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. 2025. Four Principles for Physically Interpretable World Models.arXiv preprint arXiv:2503.02143(2025). doi:10.48550/arXiv.2503.02143

  38. [39]

    Victor Rivera, Néstor Catano, Tim Wahls, and Camilo Rueda. 2017. Code generation for Event-B.International Journal on Software Tools for Technology Transfer19, 1 (2017), 31–52. doi:10.1007/s10009-015-0381-2

  39. [40]

    Peter Riviere, Neeraj Kumar Singh, Yamine Aït-Ameur, and Guillaume Dupont. 2023. Formalising liveness properties in Event-B with the reflexive EB4EB framework. InNASA Formal Methods Symposium. Springer, 312–331. doi:10.1007/978- 3-031-33170-1_19

  40. [41]

    Peter Riviere, Neeraj Kumar Singh, Yamine Ait-Ameur, and Guillaume Dupont. 2025. Extending the EB4EB framework with parameterised events.Science of Computer Programming243 (2025), 103279. doi:10.1016/j.scico.2025.103279

  41. [42]

    Ameesh Shah, Marcell Vazquez-Chanlatte, Sebastian Junges, and Sanjit A Seshia. 2023. Learning formal specifications from membership and preference queries.arXiv preprint arXiv:2307.10434(2023). doi:10.48550/arXiv.2307.10434

  42. [43]

    Jun Sun, Yang Liu, Jin Song Dong, and Jun Pang. 2009. PAT: Towards flexible verification under fairness. InInternational conference on computer aided verification. Springer, 709–714. doi:10.1007/978-3-642-02658-4_59

  43. [44]

    Haiming Wang, Huajian Xin, Zhengying Liu, Wenda Li, Yinya Huang, Jianqiao Lu, Zhicheng Yang, Jing Tang, Jian Yin, Zhenguo Li, and Xiaodan Liang. 2024. Proving Theorems Recursively. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc.,...

  44. [45]

    Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. 2024. Enchanting program specification synthesis by large language models using static analysis and program verification. InInternational Conference on Computer Aided Verification. Springer, 302–328. doi:10.1007/978-3-031- 65630-9_16 Proc. ACM So...

  45. [46]

    Haoze Wu, Clark Barrett, and Nina Narodytska. 2023. Lemur: Integrating large language models in automated program verification.arXiv preprint arXiv:2310.04870(2023). doi:10.48550/arXiv.2310.04870

  46. [47]

    Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, and Sanjit A Seshia. 2025. Provably Correct Automata Embeddings for Optimal Automata-Conditioned Reinforcement Learning.arXiv preprint arXiv:2503.05042(2025). doi:10.48550/arXiv.2503.05042

  47. [48]

    Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associ...

  48. [49]

    Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, and Jin Song Dong. 2025. Position: Trustworthy AI Agents Require the Integration of Large Language Models and Formal Methods. InForty-second International Conference on Machine Learning Position Paper Track. https://openreview.net...

  49. [50]

    Xinyue Zuo, Yifan Zhang, Hongshu Wang, Yufan Cai, Zhe Hou, Jing Sun, and Jin Song Dong. 2025. PAT-Agent: Autoformalization for Model Checking. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2122–2133. doi:10.48550/arXiv.2509.23675 Received 2025-09-12; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE,...