Event-B Agent: Towards LLM Agent for Formal Model Synthesis and Repair

Hongshu Wang; Jin Song Dong; Qin Li; Xinyue Zuo; Yamine Ait Ameur; Yuhan Sun

arxiv: 2605.17475 · v1 · pith:JERJI66Unew · submitted 2026-05-17 · 💻 cs.SE

Event-B Agent: Towards LLM Agent for Formal Model Synthesis and Repair

Hongshu Wang , Xinyue Zuo , Yuhan Sun , Qin Li , Yamine Ait Ameur , Jin Song Dong This is my paper

Pith reviewed 2026-05-19 22:48 UTC · model grok-4.3

classification 💻 cs.SE

keywords Event-BLLM agentformal model synthesismodel repairformal methodsverification feedbackautoformalizationcorrect-by-construction

0 comments

The pith

An LLM agent builds correct Event-B models from natural language by iteratively refining and repairing them with verification feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Event-B Agent as a framework that starts with natural language requirements and produces a formal Event-B model through repeated cycles of construction, refinement, and repair. Formal verification supplies error messages that the LLM uses to fix flaws and simplify proofs at each step. Refinement and repair support each other so that models become easier to verify while staying sound. A reader would care because this process lowers the expertise barrier that currently limits formal methods to specialists. The reported evaluations indicate the agent completes synthesis and repair tasks more effectively than prior methods across models of different sizes.

Core claim

Event-B Agent constructs an initial model from natural language requirements and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step. Together, these two components reinforce each other to progressively improve the model quality. Evaluation across systems of varying complexity demonstrates that Event-B Agent substantially outperforms baselines in end-to-end formal model synthesis and repair, while maintaining reasonable efficiency.

What carries the argument

The interleaved loop of model construction, refinement, and repair in which the LLM translates verification error messages into updates that keep the model correct.

If this is right

Formal models can be developed with less manual mathematical intervention while still guaranteeing correctness through verification.
Refinement steps become more tractable because repair maintains soundness at every iteration.
The combined process supports end-to-end automation from requirements to verified model across systems of different scales.
Efficiency remains practical because the agent avoids exhaustive search by focusing on targeted fixes from feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback-driven loop could be tested on other formalisms such as TLA+ to check whether the repair-refinement interaction transfers.
Adding optional human review only at refinement milestones might increase reliability for safety-critical applications without losing most automation gains.
Once a verified Event-B model exists, downstream code generation tools could be applied directly to produce executable implementations.

Load-bearing premise

The LLM can reliably translate formal verification error messages into sound repairs and refinements that preserve model correctness without introducing new inconsistencies or requiring human correction at each step.

What would settle it

Apply the agent to a set of requirements whose verification produces errors that demand non-local structural changes; failure to reach a verified model without external intervention would show the translation step does not hold.

Figures

Figures reproduced from arXiv: 2605.17475 by Hongshu Wang, Jin Song Dong, Qin Li, Xinyue Zuo, Yamine Ait Ameur, Yuhan Sun.

**Figure 2.** Figure 2: The abstract and concrete formal models constructed by Event-B Agent for the motivating example. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The grammar of Event-B, where 𝐹𝑉 (𝑝𝑟𝑒𝑑) is the set of free variables in 𝑝𝑟𝑒𝑑. We position our work relative to two lines of research. LLM proof assistants assume a fixed model 𝑀 and search for valid 𝜋, but cannot construct 𝑀0 ⇝ 𝑀1 from natural language requirements or revise 𝑀𝑡 during development. In contrast, LLM-based autoformalization methods iteratively construct models 𝑀0 ⇝ · · · ⇝ 𝑀𝑘 without proof-ba… view at source ↗

**Figure 4.** Figure 4: Overview of Event-B Agent. From natural-language requirements ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Workflow of the model & proof repair component. The model is [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Evolution of the three metrics across refinement steps [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Building software that is correct by construction is a long-standing goal in software engineering, as it ensures reliability during design and development rather than after deployment. Formal methods realize this vision by enabling the expression of system behavior and requirements in mathematics, thereby guaranteeing correctness through formal verification, including theorem proving and model checking. However, the steep learning curve and demand for mathematical expertise hinder the widespread adoption of formal methods. Large language models (LLMs) have recently shown promise in bridging this gap through autoformalization. However, existing LLM-based approaches are largely limited to isolated tasks, such as theorem proving without formalization or model synthesis with insufficient verification. While valuable, these efforts do not fully exploit the potential of a more comprehensive framework in which models and proofs evolve together, a process that closely reflects real-world development practice. To address this gap, we propose Event-B Agent, a novel framework inspired by the interleaved nature of software design. Given natural language requirements, Event-B Agent constructs an initial model and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step. Together, these two components reinforce each other to progressively improve the model quality. Evaluation across systems of varying complexity demonstrates that Event-B Agent substantially outperforms baselines in end-to-end formal model synthesis and repair, while maintaining reasonable efficiency. These results suggest that Event-B Agent is a promising step toward correct-by-construction formal model synthesis and repair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces an LLM agent that loops synthesis, verification-driven repair, and refinement for Event-B models, going beyond single-task formal methods work.

read the letter

This paper introduces an LLM agent that loops synthesis, verification-driven repair, and refinement for Event-B models, going beyond single-task formal methods work. The setup starts from natural language requirements, produces an initial model, then uses prover feedback to fix issues and simplify proofs in alternating steps. Refinement and repair are meant to support each other so the model improves without losing soundness at each stage. That integrated loop is the main new element compared to earlier papers that handled synthesis or proving in isolation. The description of the agent architecture is clear enough to show how the pieces connect and why the authors think it matches real development practice. The abstract also notes that the approach scales across systems of different sizes while staying reasonably efficient, which is a practical point worth checking. The evaluation claims substantial gains over baselines, but the abstract gives no success rates, no definition of the baselines, and no breakdown of where repairs succeeded or failed. Without those details it is difficult to tell how much the results depend on the test cases chosen. The central assumption is that the LLM can convert error messages into repairs that stay faithful to the original requirements rather than just satisfying the prover. If that mapping sometimes produces models that pass verification but drift from intent, the end-to-end advantage could be narrower than stated. This work is aimed at researchers exploring AI assistance for formal methods and at engineers who want lower barriers to using Event-B on safety-critical code. Readers already familiar with Event-B or LLM prompting for verification will get the most from it. The architecture is new enough and the adoption problem is real enough that the paper deserves a serious referee rather than a desk reject. I would send it for review and ask specifically for the missing metrics, baseline details, and any cases where repairs preserved or altered the original requirements.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Event-B Agent, an LLM-based agent framework for synthesizing initial Event-B models from natural language requirements and iteratively refining and repairing them using feedback from formal verification. Refinement simplifies proof discharge while repair ensures soundness at each step; the two are claimed to reinforce each other. Evaluation on systems of varying complexity is reported to show substantial outperformance over baselines in end-to-end synthesis and repair while maintaining reasonable efficiency.

Significance. If the results hold, the work offers a practical step toward lowering the barrier to formal methods by automating the interleaved synthesis-repair-refinement cycle that mirrors real development practice. The explicit use of prover feedback to drive LLM actions distinguishes it from prior isolated autoformalization efforts and could support more accessible correct-by-construction modeling.

major comments (2)

[§5 (Evaluation)] §5 (Evaluation): the claim of 'substantially outperforms baselines' is presented without concrete success rates, baseline definitions, number of trials, or error analysis in the reported results. This makes it impossible to judge whether the outperformance is robust or an artifact of the chosen test cases.
[§4.3 (Repair loop)] §4.3 (Repair loop): formal verification guarantees only that the repaired model satisfies its invariants; the manuscript does not describe any check that LLM-generated repairs continue to satisfy the original natural-language requirements. Without such a safeguard the autonomous end-to-end claim rests on an unverified assumption about LLM fidelity.

minor comments (2)

[Abstract] Abstract: 'reasonable efficiency' is stated without any quantitative bound (e.g., average iterations or wall-clock time); a short clarification would help readers interpret the efficiency claim.
[§3 (Framework)] Notation: the distinction between 'refinement' and 'repair' steps is introduced early but the precise interface between the two (e.g., which artifacts are passed to the LLM) is not summarized in a single diagram or pseudocode listing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and indicate where revisions will be made to improve the manuscript.

read point-by-point responses

Referee: [§5 (Evaluation)] §5 (Evaluation): the claim of 'substantially outperforms baselines' is presented without concrete success rates, baseline definitions, number of trials, or error analysis in the reported results. This makes it impossible to judge whether the outperformance is robust or an artifact of the chosen test cases.

Authors: We appreciate this observation and agree that additional quantitative detail would strengthen the evaluation. In the revised manuscript we will expand §5 to report concrete success rates for Event-B Agent and all baselines, provide explicit definitions and implementation details for each baseline, state the number of trials run per system, and include an error analysis categorizing failure modes. These additions will allow readers to assess robustness directly. revision: yes
Referee: [§4.3 (Repair loop)] §4.3 (Repair loop): formal verification guarantees only that the repaired model satisfies its invariants; the manuscript does not describe any check that LLM-generated repairs continue to satisfy the original natural-language requirements. Without such a safeguard the autonomous end-to-end claim rests on an unverified assumption about LLM fidelity.

Authors: This is a fair point. The framework encodes natural-language requirements into the initial Event-B model (invariants, events, and proof obligations) and uses verification feedback to drive repairs that preserve those invariants. However, we acknowledge that the manuscript does not explicitly describe an additional safeguard to re-validate alignment with the original requirements after each LLM-generated repair. In the revision we will add a dedicated paragraph in §4.3 clarifying this assumption, discussing its implications, and outlining a lightweight post-repair validation step that could be incorporated in future extensions. revision: partial

Circularity Check

0 steps flagged

No significant circularity in framework proposal or evaluation

full rationale

The paper introduces Event-B Agent as a novel LLM-driven iterative framework for synthesizing and repairing Event-B models from natural-language requirements, using verification feedback for refinement and repair. The central claims rest on an empirical evaluation across systems of varying complexity that shows outperformance over baselines, with no equations, fitted parameters, uniqueness theorems, or self-citations invoked as load-bearing premises. The derivation chain is self-contained: the framework is defined by construction, and the reported performance is presented as an independent empirical result rather than a statistical or definitional consequence of prior inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that current LLMs can act as reliable repair agents when given formal feedback; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption LLMs can be prompted to generate and iteratively repair formal Event-B models from natural language and verification feedback
This capability is presupposed by the agent design but is not derived or proven in the abstract.

invented entities (1)

Event-B Agent no independent evidence
purpose: Iterative synthesis and repair of Event-B models
New named framework introduced by the paper; no independent falsifiable evidence supplied.

pith-pipeline@v0.9.0 · 5809 in / 1228 out tokens · 32841 ms · 2026-05-19T22:48:38.951116+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Event-B Agent constructs an initial model and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

Jean-Raymond Abrial. 1996. The B-book: Assigning Programs to Meaning.Cambridge University Press(1996). doi:10.1017/CBO9780511624162

work page doi:10.1017/cbo9780511624162 1996
[2]

2010.Modeling in Event-B: system and software engineering

Jean-Raymond Abrial. 2010.Modeling in Event-B: system and software engineering. Cambridge University Press. doi:10.1017/CBO9781139195881

work page doi:10.1017/cbo9781139195881 2010
[3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report.arXiv preprint arXiv:2303.08774 (2023). doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[4]

Eman Alkhammash, Michael Butler, Asieh Salehi Fathabadi, and Corina Cîrstea. 2015. Building traceable Event-B models from requirements.Science of Computer Programming111 (2015), 318–338. doi:10.1016/j.scico.2015.06.002

work page doi:10.1016/j.scico.2015.06.002 2015
[5]

Anysphere, Inc. 2025. Cursor: The AI Code Editor. https://cursor.com/ Accessed: 2025-09-11

work page 2025
[6]

Clark Barrett, Christopher L Conway, Morgan Deters, Liana Hadarean, Dejan Jovanović, Tim King, Andrew Reynolds, and Cesare Tinelli. 2011. cvc4. InInternational Conference on Computer Aided Verification. Springer, 171–177. doi:10. 1007/978-3-642-22110-1_14

work page 2011
[7]

Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. 2020. The tactician: A seamless, interactive tactic learner and prover for coq. InInternational Conference on Intelligent Computer Mathematics. Springer, 271–277. doi:10.1007/978-3- 030-53518-6_17

work page doi:10.1007/978-3- 2020
[8]

Cheng-Hao Cai, Jing Sun, and Gillian Dobbie. 2019. Automatic B-model repair using model checking and machine learning.Automated Software Engineering26, 3 (2019), 653–704. doi:10.1007/s10515-019-00264-4

work page doi:10.1007/s10515-019-00264-4 2019
[9]

Cheng-Hao Cai, Jing Sun, Gillian Dobbie, Zhé Hóu, Hadrien Bride, Jin Song Dong, and Scott Uk-Jin Lee. 2022. Fast automated abstract machine repair using simultaneous modifications and refactoring.Formal Aspects of Computing34, 2 (2022), 1–31. doi:10.1145/3536430

work page doi:10.1145/3536430 2022
[10]

Alfredo Capozucca, Daniil Yampolskyi, Alexander Goldberg, and Maximiliano Cristiá. 2025. Do ai assistants help students write formal specifications? a study with chatgpt and the b-method. In2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, 19–29. doi:10.1109/CSEET66350.2025.00009

work page doi:10.1109/cseet66350.2025.00009 2025
[11]

Pedro Carrott, Nuno Saavedra, Kyle Thompson, Sorin Lerner, João F Ferreira, and Emily First. 2024. CoqPyt: proof navigation in Python in the era of LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 637–641. doi:10.1145/3663529.3663814

work page doi:10.1145/3663529.3663814 2024
[12]

Swarat Chaudhuri, Kevin Ellis, Oleksandr Polozov, Rishabh Singh, Armando Solar-Lezama, and Yisong Yue. 2021. Neurosymbolic programming.Foundations and Trends in Programming Languages7, 3 (2021), 158–243. doi:10.1561/ 2500000049

work page 2021
[13]

Mark Chevallier, Filip Smola, Richard Schmoetten, and Jacques D Fleuriot. 2025. Formally Verified Neurosymbolic Trajectory Learning via Tensor-based Linear Temporal Logic on Finite Traces.arXiv preprint arXiv:2501.13712(2025). doi:10.48550/arXiv.2501.13712

work page doi:10.48550/arxiv.2501.13712 2025
[14]

Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. InInternational Conference on Computer Aided Verification. Springer, 383–396. doi:10.1007/978-3-031-37703-7_18

work page doi:10.1007/978-3-031-37703-7_18 2023
[15]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340. doi:10.1007/978-3-540-78800-3_24

work page doi:10.1007/978-3-540-78800-3_24 2008
[16]

Guillaume Dupont, Yamine Ait-Ameur, Neeraj Kumar Singh, and Marc Pantel. 2021. Event-B hybridation: A proof and refinement-based framework for modelling hybrid systems.ACM Transactions on Embedded Computing Systems (TECS)20, 4 (2021), 1–37. doi:10.1145/3448270

work page doi:10.1145/3448270 2021
[17]

Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1229–1241. doi:10.1145/3611643.3616243

work page doi:10.1145/3611643.3616243 2023
[18]

Francesco Fuggitti and Tathagata Chakraborti. 2023. NL2LTL–a python package for converting natural language (NL) instructions to linear temporal logic (LTL) formulas. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 16428–16430. doi:10.1609/aaai.v37i13.27068

work page doi:10.1609/aaai.v37i13.27068 2023
[20]

Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ramana Kumar, and Michael Norrish. 2021. TacticToe: learning to prove with tactics.Journal of Automated Reasoning65, 2 (2021), 257–286. doi:10.1007/s10817-020-09580-x

work page doi:10.1007/s10817-020-09580-x 2021
[21]

Chuqin Geng, Zhaoyue Wang, Haolin Ye, and Xujie Si. 2024. Learning Minimal Neural Specifications.arXiv preprint arXiv:2404.04662(2024). doi:10.48550/arXiv.2404.04662

work page doi:10.48550/arxiv.2404.04662 2024
[22]

Fabian Gloeckle, Baptiste Roziere, Amaury Hayat, and Gabriel Synnaeve. 2023. Temperature-scaled large language models for Lean proofstep prediction. InThe 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23. https: Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE211. Publication date: July 2026. FSE211:22 Hongshu Wang, Xinyue Zuo, Yuhan Sun, Qi...

work page 2023
[23]

Heinrich Heine University Düsseldorf, Group for Software Engineering and Programming Languages. 2025. ProB Animator and Model Checker. https://prob.hhu.de/ Accessed: 2025-09-11

work page 2025
[24]

Heinrich Heine University Düsseldorf, Group for Software Engineering and Programming Languages. 2025. Rodin User’s Handbook v.2.8. https://stups.hhu-hosting.de/handbook/rodin/current/html/introduction.html Accessed: 2025-09-11

work page 2025
[25]

Yang Hong, Shan Jiang, Yulei Fu, and Sarfraz Khurshid. 2025. On the Effectiveness of Large Language Models in Writing Alloy Formulas.arXiv preprint arXiv:2502.15441(2025). doi:10.48550/arXiv.2502.15441

work page doi:10.48550/arxiv.2502.15441 2025
[26]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79. doi:10.1145/3695988

work page doi:10.1145/3695988 2024
[27]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.arXiv preprint arXiv:2406.00515(2024). doi:10.1145/3747588

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3747588 2024
[28]

Tsutomu Kobayashi and Fuyuki Ishikawa. 2024. Repairing Event-B Models Through Quantifier Elimination. In International Conference on Formal Engineering Methods. Springer, 18–36. doi:10.1007/978-981-96-0617-7_2

work page doi:10.1007/978-981-96-0617-7_2 2024
[29]

Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, and Anima Anandkumar. 2024. LeanAgent: Lifelong Learning for Formal Theorem Proving.arXiv preprint arXiv:2410.06209(2024). doi:10.48550/arXiv. 2410.06209

work page internal anchor Pith review doi:10.48550/arxiv 2024
[30]

Haohan Lin, Zhiqing Sun, Sean Welleck, and Yiming Yang. 2024. Lean-star: Learning to interleave thinking and proving. arXiv preprint arXiv:2407.10040(2024). doi:10.48550/arXiv.2407.10040

work page doi:10.48550/arxiv.2407.10040 2024
[31]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation with large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1509–1520. doi:10.1145/3691620. 3695521

work page doi:10.1145/3691620 2024
[32]

Atif Mashkoor, Faqing Yang, and Jean-Pierre Jacquot. 2017. Refinement-based validation of Event-B specifications. Software & Systems Modeling16, 3 (2017), 789–808. doi:10.1007/s10270-016-0514-4

work page doi:10.1007/s10270-016-0514-4 2017
[33]

Dominique Méry and Neeraj Kumar Singh. 2011. Automatic code generation from Event-B models. InProceedings of the 2nd Symposium on Information and Communication Technology. 179–188. doi:10.1145/2069216.2069252

work page doi:10.1145/2069216.2069252 2011
[34]

Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak, Bartosz Piotrowski, Albert Qiaochu Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, and Yuhuai Wu. 2023. Magnushammer: A transformer-based approach to premise selection.arXiv preprint arXiv:2303.04488(2023). doi:10.48550/arXiv.2303.04488

work page doi:10.48550/arxiv.2303.04488 2023
[35]

Yutaka Nagashima and Yilun He. 2018. PaMpeR: proof method recommendation system for Isabelle/HOL. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 362–372. doi:10.1145/3238147.3238210

work page doi:10.1145/3238147.3238210 2018
[36]

Yutaka Nagashima and Ramana Kumar. 2017. A proof strategy language and proof script generation for Isabelle/HOL. InInternational Conference on Automated Deduction. Springer, 528–545. doi:10.1007/978-3-319-63046-5_32

work page doi:10.1007/978-3-319-63046-5_32 2017
[37]

OpenAI. 2025. GPT-5. https://openai.com/gpt-5/ Released: 2025-08-07

work page 2025
[38]

Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. 2025. Four Principles for Physically Interpretable World Models.arXiv preprint arXiv:2503.02143(2025). doi:10.48550/arXiv.2503.02143

work page doi:10.48550/arxiv.2503.02143 2025
[39]

Victor Rivera, Néstor Catano, Tim Wahls, and Camilo Rueda. 2017. Code generation for Event-B.International Journal on Software Tools for Technology Transfer19, 1 (2017), 31–52. doi:10.1007/s10009-015-0381-2

work page doi:10.1007/s10009-015-0381-2 2017
[40]

Peter Riviere, Neeraj Kumar Singh, Yamine Aït-Ameur, and Guillaume Dupont. 2023. Formalising liveness properties in Event-B with the reflexive EB4EB framework. InNASA Formal Methods Symposium. Springer, 312–331. doi:10.1007/978- 3-031-33170-1_19

work page doi:10.1007/978- 2023
[41]

Peter Riviere, Neeraj Kumar Singh, Yamine Ait-Ameur, and Guillaume Dupont. 2025. Extending the EB4EB framework with parameterised events.Science of Computer Programming243 (2025), 103279. doi:10.1016/j.scico.2025.103279

work page doi:10.1016/j.scico.2025.103279 2025
[42]

Ameesh Shah, Marcell Vazquez-Chanlatte, Sebastian Junges, and Sanjit A Seshia. 2023. Learning formal specifications from membership and preference queries.arXiv preprint arXiv:2307.10434(2023). doi:10.48550/arXiv.2307.10434

work page doi:10.48550/arxiv.2307.10434 2023
[43]

Jun Sun, Yang Liu, Jin Song Dong, and Jun Pang. 2009. PAT: Towards flexible verification under fairness. InInternational conference on computer aided verification. Springer, 709–714. doi:10.1007/978-3-642-02658-4_59

work page doi:10.1007/978-3-642-02658-4_59 2009
[44]

Haiming Wang, Huajian Xin, Zhengying Liu, Wenda Li, Yinya Huang, Jianqiao Lu, Zhicheng Yang, Jing Tang, Jian Yin, Zhenguo Li, and Xiaodan Liang. 2024. Proving Theorems Recursively. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc.,...

work page doi:10.52202/079017-2753 2024
[45]

Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. 2024. Enchanting program specification synthesis by large language models using static analysis and program verification. InInternational Conference on Computer Aided Verification. Springer, 302–328. doi:10.1007/978-3-031- 65630-9_16 Proc. ACM So...

work page doi:10.1007/978-3-031- 2024
[46]

Haoze Wu, Clark Barrett, and Nina Narodytska. 2023. Lemur: Integrating large language models in automated program verification.arXiv preprint arXiv:2310.04870(2023). doi:10.48550/arXiv.2310.04870

work page doi:10.48550/arxiv.2310.04870 2023
[47]

Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, and Sanjit A Seshia. 2025. Provably Correct Automata Embeddings for Optimal Automata-Conditioned Reinforcement Learning.arXiv preprint arXiv:2503.05042(2025). doi:10.48550/arXiv.2503.05042

work page doi:10.48550/arxiv.2503.05042 2025
[48]

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associ...

work page 2023
[49]

Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, and Jin Song Dong. 2025. Position: Trustworthy AI Agents Require the Integration of Large Language Models and Formal Methods. InForty-second International Conference on Machine Learning Position Paper Track. https://openreview.net...

work page 2025
[50]

Xinyue Zuo, Yifan Zhang, Hongshu Wang, Yufan Cai, Zhe Hou, Jing Sun, and Jin Song Dong. 2025. PAT-Agent: Autoformalization for Model Checking. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2122–2133. doi:10.48550/arXiv.2509.23675 Received 2025-09-12; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE,...

work page doi:10.48550/arxiv.2509.23675 2025

[1] [1]

Jean-Raymond Abrial. 1996. The B-book: Assigning Programs to Meaning.Cambridge University Press(1996). doi:10.1017/CBO9780511624162

work page doi:10.1017/cbo9780511624162 1996

[2] [2]

2010.Modeling in Event-B: system and software engineering

Jean-Raymond Abrial. 2010.Modeling in Event-B: system and software engineering. Cambridge University Press. doi:10.1017/CBO9781139195881

work page doi:10.1017/cbo9781139195881 2010

[3] [3]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report.arXiv preprint arXiv:2303.08774 (2023). doi:10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[4] [4]

Eman Alkhammash, Michael Butler, Asieh Salehi Fathabadi, and Corina Cîrstea. 2015. Building traceable Event-B models from requirements.Science of Computer Programming111 (2015), 318–338. doi:10.1016/j.scico.2015.06.002

work page doi:10.1016/j.scico.2015.06.002 2015

[5] [5]

Anysphere, Inc. 2025. Cursor: The AI Code Editor. https://cursor.com/ Accessed: 2025-09-11

work page 2025

[6] [6]

Clark Barrett, Christopher L Conway, Morgan Deters, Liana Hadarean, Dejan Jovanović, Tim King, Andrew Reynolds, and Cesare Tinelli. 2011. cvc4. InInternational Conference on Computer Aided Verification. Springer, 171–177. doi:10. 1007/978-3-642-22110-1_14

work page 2011

[7] [7]

Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. 2020. The tactician: A seamless, interactive tactic learner and prover for coq. InInternational Conference on Intelligent Computer Mathematics. Springer, 271–277. doi:10.1007/978-3- 030-53518-6_17

work page doi:10.1007/978-3- 2020

[8] [8]

Cheng-Hao Cai, Jing Sun, and Gillian Dobbie. 2019. Automatic B-model repair using model checking and machine learning.Automated Software Engineering26, 3 (2019), 653–704. doi:10.1007/s10515-019-00264-4

work page doi:10.1007/s10515-019-00264-4 2019

[9] [9]

Cheng-Hao Cai, Jing Sun, Gillian Dobbie, Zhé Hóu, Hadrien Bride, Jin Song Dong, and Scott Uk-Jin Lee. 2022. Fast automated abstract machine repair using simultaneous modifications and refactoring.Formal Aspects of Computing34, 2 (2022), 1–31. doi:10.1145/3536430

work page doi:10.1145/3536430 2022

[10] [10]

Alfredo Capozucca, Daniil Yampolskyi, Alexander Goldberg, and Maximiliano Cristiá. 2025. Do ai assistants help students write formal specifications? a study with chatgpt and the b-method. In2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, 19–29. doi:10.1109/CSEET66350.2025.00009

work page doi:10.1109/cseet66350.2025.00009 2025

[11] [11]

Pedro Carrott, Nuno Saavedra, Kyle Thompson, Sorin Lerner, João F Ferreira, and Emily First. 2024. CoqPyt: proof navigation in Python in the era of LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 637–641. doi:10.1145/3663529.3663814

work page doi:10.1145/3663529.3663814 2024

[12] [12]

Swarat Chaudhuri, Kevin Ellis, Oleksandr Polozov, Rishabh Singh, Armando Solar-Lezama, and Yisong Yue. 2021. Neurosymbolic programming.Foundations and Trends in Programming Languages7, 3 (2021), 158–243. doi:10.1561/ 2500000049

work page 2021

[13] [13]

Mark Chevallier, Filip Smola, Richard Schmoetten, and Jacques D Fleuriot. 2025. Formally Verified Neurosymbolic Trajectory Learning via Tensor-based Linear Temporal Logic on Finite Traces.arXiv preprint arXiv:2501.13712(2025). doi:10.48550/arXiv.2501.13712

work page doi:10.48550/arxiv.2501.13712 2025

[14] [14]

Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. InInternational Conference on Computer Aided Verification. Springer, 383–396. doi:10.1007/978-3-031-37703-7_18

work page doi:10.1007/978-3-031-37703-7_18 2023

[15] [15]

Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340. doi:10.1007/978-3-540-78800-3_24

work page doi:10.1007/978-3-540-78800-3_24 2008

[16] [16]

Guillaume Dupont, Yamine Ait-Ameur, Neeraj Kumar Singh, and Marc Pantel. 2021. Event-B hybridation: A proof and refinement-based framework for modelling hybrid systems.ACM Transactions on Embedded Computing Systems (TECS)20, 4 (2021), 1–37. doi:10.1145/3448270

work page doi:10.1145/3448270 2021

[17] [17]

Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1229–1241. doi:10.1145/3611643.3616243

work page doi:10.1145/3611643.3616243 2023

[18] [18]

Francesco Fuggitti and Tathagata Chakraborti. 2023. NL2LTL–a python package for converting natural language (NL) instructions to linear temporal logic (LTL) formulas. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 16428–16430. doi:10.1609/aaai.v37i13.27068

work page doi:10.1609/aaai.v37i13.27068 2023

[19] [20]

Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ramana Kumar, and Michael Norrish. 2021. TacticToe: learning to prove with tactics.Journal of Automated Reasoning65, 2 (2021), 257–286. doi:10.1007/s10817-020-09580-x

work page doi:10.1007/s10817-020-09580-x 2021

[20] [21]

Chuqin Geng, Zhaoyue Wang, Haolin Ye, and Xujie Si. 2024. Learning Minimal Neural Specifications.arXiv preprint arXiv:2404.04662(2024). doi:10.48550/arXiv.2404.04662

work page doi:10.48550/arxiv.2404.04662 2024

[21] [22]

Fabian Gloeckle, Baptiste Roziere, Amaury Hayat, and Gabriel Synnaeve. 2023. Temperature-scaled large language models for Lean proofstep prediction. InThe 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23. https: Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE211. Publication date: July 2026. FSE211:22 Hongshu Wang, Xinyue Zuo, Yuhan Sun, Qi...

work page 2023

[22] [23]

Heinrich Heine University Düsseldorf, Group for Software Engineering and Programming Languages. 2025. ProB Animator and Model Checker. https://prob.hhu.de/ Accessed: 2025-09-11

work page 2025

[23] [24]

Heinrich Heine University Düsseldorf, Group for Software Engineering and Programming Languages. 2025. Rodin User’s Handbook v.2.8. https://stups.hhu-hosting.de/handbook/rodin/current/html/introduction.html Accessed: 2025-09-11

work page 2025

[24] [25]

Yang Hong, Shan Jiang, Yulei Fu, and Sarfraz Khurshid. 2025. On the Effectiveness of Large Language Models in Writing Alloy Formulas.arXiv preprint arXiv:2502.15441(2025). doi:10.48550/arXiv.2502.15441

work page doi:10.48550/arxiv.2502.15441 2025

[25] [26]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79. doi:10.1145/3695988

work page doi:10.1145/3695988 2024

[26] [27]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.arXiv preprint arXiv:2406.00515(2024). doi:10.1145/3747588

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3747588 2024

[27] [28]

Tsutomu Kobayashi and Fuyuki Ishikawa. 2024. Repairing Event-B Models Through Quantifier Elimination. In International Conference on Formal Engineering Methods. Springer, 18–36. doi:10.1007/978-981-96-0617-7_2

work page doi:10.1007/978-981-96-0617-7_2 2024

[28] [29]

Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, and Anima Anandkumar. 2024. LeanAgent: Lifelong Learning for Formal Theorem Proving.arXiv preprint arXiv:2410.06209(2024). doi:10.48550/arXiv. 2410.06209

work page internal anchor Pith review doi:10.48550/arxiv 2024

[29] [30]

Haohan Lin, Zhiqing Sun, Sean Welleck, and Yiming Yang. 2024. Lean-star: Learning to interleave thinking and proving. arXiv preprint arXiv:2407.10040(2024). doi:10.48550/arXiv.2407.10040

work page doi:10.48550/arxiv.2407.10040 2024

[30] [31]

Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation with large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1509–1520. doi:10.1145/3691620. 3695521

work page doi:10.1145/3691620 2024

[31] [32]

Atif Mashkoor, Faqing Yang, and Jean-Pierre Jacquot. 2017. Refinement-based validation of Event-B specifications. Software & Systems Modeling16, 3 (2017), 789–808. doi:10.1007/s10270-016-0514-4

work page doi:10.1007/s10270-016-0514-4 2017

[32] [33]

Dominique Méry and Neeraj Kumar Singh. 2011. Automatic code generation from Event-B models. InProceedings of the 2nd Symposium on Information and Communication Technology. 179–188. doi:10.1145/2069216.2069252

work page doi:10.1145/2069216.2069252 2011

[33] [34]

Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak, Bartosz Piotrowski, Albert Qiaochu Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, and Yuhuai Wu. 2023. Magnushammer: A transformer-based approach to premise selection.arXiv preprint arXiv:2303.04488(2023). doi:10.48550/arXiv.2303.04488

work page doi:10.48550/arxiv.2303.04488 2023

[34] [35]

Yutaka Nagashima and Yilun He. 2018. PaMpeR: proof method recommendation system for Isabelle/HOL. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 362–372. doi:10.1145/3238147.3238210

work page doi:10.1145/3238147.3238210 2018

[35] [36]

Yutaka Nagashima and Ramana Kumar. 2017. A proof strategy language and proof script generation for Isabelle/HOL. InInternational Conference on Automated Deduction. Springer, 528–545. doi:10.1007/978-3-319-63046-5_32

work page doi:10.1007/978-3-319-63046-5_32 2017

[36] [37]

OpenAI. 2025. GPT-5. https://openai.com/gpt-5/ Released: 2025-08-07

work page 2025

[37] [38]

Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. 2025. Four Principles for Physically Interpretable World Models.arXiv preprint arXiv:2503.02143(2025). doi:10.48550/arXiv.2503.02143

work page doi:10.48550/arxiv.2503.02143 2025

[38] [39]

Victor Rivera, Néstor Catano, Tim Wahls, and Camilo Rueda. 2017. Code generation for Event-B.International Journal on Software Tools for Technology Transfer19, 1 (2017), 31–52. doi:10.1007/s10009-015-0381-2

work page doi:10.1007/s10009-015-0381-2 2017

[39] [40]

Peter Riviere, Neeraj Kumar Singh, Yamine Aït-Ameur, and Guillaume Dupont. 2023. Formalising liveness properties in Event-B with the reflexive EB4EB framework. InNASA Formal Methods Symposium. Springer, 312–331. doi:10.1007/978- 3-031-33170-1_19

work page doi:10.1007/978- 2023

[40] [41]

Peter Riviere, Neeraj Kumar Singh, Yamine Ait-Ameur, and Guillaume Dupont. 2025. Extending the EB4EB framework with parameterised events.Science of Computer Programming243 (2025), 103279. doi:10.1016/j.scico.2025.103279

work page doi:10.1016/j.scico.2025.103279 2025

[41] [42]

Ameesh Shah, Marcell Vazquez-Chanlatte, Sebastian Junges, and Sanjit A Seshia. 2023. Learning formal specifications from membership and preference queries.arXiv preprint arXiv:2307.10434(2023). doi:10.48550/arXiv.2307.10434

work page doi:10.48550/arxiv.2307.10434 2023

[42] [43]

Jun Sun, Yang Liu, Jin Song Dong, and Jun Pang. 2009. PAT: Towards flexible verification under fairness. InInternational conference on computer aided verification. Springer, 709–714. doi:10.1007/978-3-642-02658-4_59

work page doi:10.1007/978-3-642-02658-4_59 2009

[43] [44]

Haiming Wang, Huajian Xin, Zhengying Liu, Wenda Li, Yinya Huang, Jianqiao Lu, Zhicheng Yang, Jing Tang, Jian Yin, Zhenguo Li, and Xiaodan Liang. 2024. Proving Theorems Recursively. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc.,...

work page doi:10.52202/079017-2753 2024

[44] [45]

Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. 2024. Enchanting program specification synthesis by large language models using static analysis and program verification. InInternational Conference on Computer Aided Verification. Springer, 302–328. doi:10.1007/978-3-031- 65630-9_16 Proc. ACM So...

work page doi:10.1007/978-3-031- 2024

[45] [46]

Haoze Wu, Clark Barrett, and Nina Narodytska. 2023. Lemur: Integrating large language models in automated program verification.arXiv preprint arXiv:2310.04870(2023). doi:10.48550/arXiv.2310.04870

work page doi:10.48550/arxiv.2310.04870 2023

[46] [47]

Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, and Sanjit A Seshia. 2025. Provably Correct Automata Embeddings for Optimal Automata-Conditioned Reinforcement Learning.arXiv preprint arXiv:2503.05042(2025). doi:10.48550/arXiv.2503.05042

work page doi:10.48550/arxiv.2503.05042 2025

[47] [48]

Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associ...

work page 2023

[48] [49]

Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, and Jin Song Dong. 2025. Position: Trustworthy AI Agents Require the Integration of Large Language Models and Formal Methods. InForty-second International Conference on Machine Learning Position Paper Track. https://openreview.net...

work page 2025

[49] [50]

Xinyue Zuo, Yifan Zhang, Hongshu Wang, Yufan Cai, Zhe Hou, Jing Sun, and Jin Song Dong. 2025. PAT-Agent: Autoformalization for Model Checking. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2122–2133. doi:10.48550/arXiv.2509.23675 Received 2025-09-12; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE,...

work page doi:10.48550/arxiv.2509.23675 2025