Event-B Agent: Towards LLM Agent for Formal Model Synthesis and Repair
Pith reviewed 2026-05-19 22:48 UTC · model grok-4.3
The pith
An LLM agent builds correct Event-B models from natural language by iteratively refining and repairing them with verification feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Event-B Agent constructs an initial model from natural language requirements and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step. Together, these two components reinforce each other to progressively improve the model quality. Evaluation across systems of varying complexity demonstrates that Event-B Agent substantially outperforms baselines in end-to-end formal model synthesis and repair, while maintaining reasonable efficiency.
What carries the argument
The interleaved loop of model construction, refinement, and repair in which the LLM translates verification error messages into updates that keep the model correct.
If this is right
- Formal models can be developed with less manual mathematical intervention while still guaranteeing correctness through verification.
- Refinement steps become more tractable because repair maintains soundness at every iteration.
- The combined process supports end-to-end automation from requirements to verified model across systems of different scales.
- Efficiency remains practical because the agent avoids exhaustive search by focusing on targeted fixes from feedback.
Where Pith is reading between the lines
- The same feedback-driven loop could be tested on other formalisms such as TLA+ to check whether the repair-refinement interaction transfers.
- Adding optional human review only at refinement milestones might increase reliability for safety-critical applications without losing most automation gains.
- Once a verified Event-B model exists, downstream code generation tools could be applied directly to produce executable implementations.
Load-bearing premise
The LLM can reliably translate formal verification error messages into sound repairs and refinements that preserve model correctness without introducing new inconsistencies or requiring human correction at each step.
What would settle it
Apply the agent to a set of requirements whose verification produces errors that demand non-local structural changes; failure to reach a verified model without external intervention would show the translation step does not hold.
Figures
read the original abstract
Building software that is correct by construction is a long-standing goal in software engineering, as it ensures reliability during design and development rather than after deployment. Formal methods realize this vision by enabling the expression of system behavior and requirements in mathematics, thereby guaranteeing correctness through formal verification, including theorem proving and model checking. However, the steep learning curve and demand for mathematical expertise hinder the widespread adoption of formal methods. Large language models (LLMs) have recently shown promise in bridging this gap through autoformalization. However, existing LLM-based approaches are largely limited to isolated tasks, such as theorem proving without formalization or model synthesis with insufficient verification. While valuable, these efforts do not fully exploit the potential of a more comprehensive framework in which models and proofs evolve together, a process that closely reflects real-world development practice. To address this gap, we propose Event-B Agent, a novel framework inspired by the interleaved nature of software design. Given natural language requirements, Event-B Agent constructs an initial model and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step. Together, these two components reinforce each other to progressively improve the model quality. Evaluation across systems of varying complexity demonstrates that Event-B Agent substantially outperforms baselines in end-to-end formal model synthesis and repair, while maintaining reasonable efficiency. These results suggest that Event-B Agent is a promising step toward correct-by-construction formal model synthesis and repair.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Event-B Agent, an LLM-based agent framework for synthesizing initial Event-B models from natural language requirements and iteratively refining and repairing them using feedback from formal verification. Refinement simplifies proof discharge while repair ensures soundness at each step; the two are claimed to reinforce each other. Evaluation on systems of varying complexity is reported to show substantial outperformance over baselines in end-to-end synthesis and repair while maintaining reasonable efficiency.
Significance. If the results hold, the work offers a practical step toward lowering the barrier to formal methods by automating the interleaved synthesis-repair-refinement cycle that mirrors real development practice. The explicit use of prover feedback to drive LLM actions distinguishes it from prior isolated autoformalization efforts and could support more accessible correct-by-construction modeling.
major comments (2)
- [§5 (Evaluation)] §5 (Evaluation): the claim of 'substantially outperforms baselines' is presented without concrete success rates, baseline definitions, number of trials, or error analysis in the reported results. This makes it impossible to judge whether the outperformance is robust or an artifact of the chosen test cases.
- [§4.3 (Repair loop)] §4.3 (Repair loop): formal verification guarantees only that the repaired model satisfies its invariants; the manuscript does not describe any check that LLM-generated repairs continue to satisfy the original natural-language requirements. Without such a safeguard the autonomous end-to-end claim rests on an unverified assumption about LLM fidelity.
minor comments (2)
- [Abstract] Abstract: 'reasonable efficiency' is stated without any quantitative bound (e.g., average iterations or wall-clock time); a short clarification would help readers interpret the efficiency claim.
- [§3 (Framework)] Notation: the distinction between 'refinement' and 'repair' steps is introduced early but the precise interface between the two (e.g., which artifacts are passed to the LLM) is not summarized in a single diagram or pseudocode listing.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below with clarifications and indicate where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: [§5 (Evaluation)] §5 (Evaluation): the claim of 'substantially outperforms baselines' is presented without concrete success rates, baseline definitions, number of trials, or error analysis in the reported results. This makes it impossible to judge whether the outperformance is robust or an artifact of the chosen test cases.
Authors: We appreciate this observation and agree that additional quantitative detail would strengthen the evaluation. In the revised manuscript we will expand §5 to report concrete success rates for Event-B Agent and all baselines, provide explicit definitions and implementation details for each baseline, state the number of trials run per system, and include an error analysis categorizing failure modes. These additions will allow readers to assess robustness directly. revision: yes
-
Referee: [§4.3 (Repair loop)] §4.3 (Repair loop): formal verification guarantees only that the repaired model satisfies its invariants; the manuscript does not describe any check that LLM-generated repairs continue to satisfy the original natural-language requirements. Without such a safeguard the autonomous end-to-end claim rests on an unverified assumption about LLM fidelity.
Authors: This is a fair point. The framework encodes natural-language requirements into the initial Event-B model (invariants, events, and proof obligations) and uses verification feedback to drive repairs that preserve those invariants. However, we acknowledge that the manuscript does not explicitly describe an additional safeguard to re-validate alignment with the original requirements after each LLM-generated repair. In the revision we will add a dedicated paragraph in §4.3 clarifying this assumption, discussing its implications, and outlining a lightweight post-repair validation step that could be incorporated in future extensions. revision: partial
Circularity Check
No significant circularity in framework proposal or evaluation
full rationale
The paper introduces Event-B Agent as a novel LLM-driven iterative framework for synthesizing and repairing Event-B models from natural-language requirements, using verification feedback for refinement and repair. The central claims rest on an empirical evaluation across systems of varying complexity that shows outperformance over baselines, with no equations, fitted parameters, uniqueness theorems, or self-citations invoked as load-bearing premises. The derivation chain is self-contained: the framework is defined by construction, and the reported performance is presented as an independent empirical result rather than a statistical or definitional consequence of prior inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs can be prompted to generate and iteratively repair formal Event-B models from natural language and verification feedback
invented entities (1)
-
Event-B Agent
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Event-B Agent constructs an initial model and iteratively repairs and refines it using formal verification feedback. Refinement simplifies proof discharge, while repair of models and proofs ensures the soundness of each refinement step.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jean-Raymond Abrial. 1996. The B-book: Assigning Programs to Meaning.Cambridge University Press(1996). doi:10.1017/CBO9780511624162
-
[2]
2010.Modeling in Event-B: system and software engineering
Jean-Raymond Abrial. 2010.Modeling in Event-B: system and software engineering. Cambridge University Press. doi:10.1017/CBO9781139195881
-
[3]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report.arXiv preprint arXiv:2303.08774 (2023). doi:10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[4]
Eman Alkhammash, Michael Butler, Asieh Salehi Fathabadi, and Corina Cîrstea. 2015. Building traceable Event-B models from requirements.Science of Computer Programming111 (2015), 318–338. doi:10.1016/j.scico.2015.06.002
-
[5]
Anysphere, Inc. 2025. Cursor: The AI Code Editor. https://cursor.com/ Accessed: 2025-09-11
work page 2025
-
[6]
Clark Barrett, Christopher L Conway, Morgan Deters, Liana Hadarean, Dejan Jovanović, Tim King, Andrew Reynolds, and Cesare Tinelli. 2011. cvc4. InInternational Conference on Computer Aided Verification. Springer, 171–177. doi:10. 1007/978-3-642-22110-1_14
work page 2011
-
[7]
Lasse Blaauwbroek, Josef Urban, and Herman Geuvers. 2020. The tactician: A seamless, interactive tactic learner and prover for coq. InInternational Conference on Intelligent Computer Mathematics. Springer, 271–277. doi:10.1007/978-3- 030-53518-6_17
-
[8]
Cheng-Hao Cai, Jing Sun, and Gillian Dobbie. 2019. Automatic B-model repair using model checking and machine learning.Automated Software Engineering26, 3 (2019), 653–704. doi:10.1007/s10515-019-00264-4
-
[9]
Cheng-Hao Cai, Jing Sun, Gillian Dobbie, Zhé Hóu, Hadrien Bride, Jin Song Dong, and Scott Uk-Jin Lee. 2022. Fast automated abstract machine repair using simultaneous modifications and refactoring.Formal Aspects of Computing34, 2 (2022), 1–31. doi:10.1145/3536430
-
[10]
Alfredo Capozucca, Daniil Yampolskyi, Alexander Goldberg, and Maximiliano Cristiá. 2025. Do ai assistants help students write formal specifications? a study with chatgpt and the b-method. In2025 IEEE/ACM 37th International Conference on Software Engineering Education and Training (CSEE&T). IEEE, 19–29. doi:10.1109/CSEET66350.2025.00009
-
[11]
Pedro Carrott, Nuno Saavedra, Kyle Thompson, Sorin Lerner, João F Ferreira, and Emily First. 2024. CoqPyt: proof navigation in Python in the era of LLMs. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 637–641. doi:10.1145/3663529.3663814
-
[12]
Swarat Chaudhuri, Kevin Ellis, Oleksandr Polozov, Rishabh Singh, Armando Solar-Lezama, and Yisong Yue. 2021. Neurosymbolic programming.Foundations and Trends in Programming Languages7, 3 (2021), 158–243. doi:10.1561/ 2500000049
work page 2021
-
[13]
Mark Chevallier, Filip Smola, Richard Schmoetten, and Jacques D Fleuriot. 2025. Formally Verified Neurosymbolic Trajectory Learning via Tensor-based Linear Temporal Logic on Finite Traces.arXiv preprint arXiv:2501.13712(2025). doi:10.48550/arXiv.2501.13712
-
[14]
Matthias Cosler, Christopher Hahn, Daniel Mendoza, Frederik Schmitt, and Caroline Trippel. 2023. nl2spec: Interactively translating unstructured natural language to temporal logics with large language models. InInternational Conference on Computer Aided Verification. Springer, 383–396. doi:10.1007/978-3-031-37703-7_18
-
[15]
Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. InInternational conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 337–340. doi:10.1007/978-3-540-78800-3_24
-
[16]
Guillaume Dupont, Yamine Ait-Ameur, Neeraj Kumar Singh, and Marc Pantel. 2021. Event-B hybridation: A proof and refinement-based framework for modelling hybrid systems.ACM Transactions on Embedded Computing Systems (TECS)20, 4 (2021), 1–37. doi:10.1145/3448270
-
[17]
Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1229–1241. doi:10.1145/3611643.3616243
-
[18]
Francesco Fuggitti and Tathagata Chakraborti. 2023. NL2LTL–a python package for converting natural language (NL) instructions to linear temporal logic (LTL) formulas. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 16428–16430. doi:10.1609/aaai.v37i13.27068
-
[20]
Thibault Gauthier, Cezary Kaliszyk, Josef Urban, Ramana Kumar, and Michael Norrish. 2021. TacticToe: learning to prove with tactics.Journal of Automated Reasoning65, 2 (2021), 257–286. doi:10.1007/s10817-020-09580-x
-
[21]
Chuqin Geng, Zhaoyue Wang, Haolin Ye, and Xujie Si. 2024. Learning Minimal Neural Specifications.arXiv preprint arXiv:2404.04662(2024). doi:10.48550/arXiv.2404.04662
-
[22]
Fabian Gloeckle, Baptiste Roziere, Amaury Hayat, and Gabriel Synnaeve. 2023. Temperature-scaled large language models for Lean proofstep prediction. InThe 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23. https: Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE211. Publication date: July 2026. FSE211:22 Hongshu Wang, Xinyue Zuo, Yuhan Sun, Qi...
work page 2023
-
[23]
Heinrich Heine University Düsseldorf, Group for Software Engineering and Programming Languages. 2025. ProB Animator and Model Checker. https://prob.hhu.de/ Accessed: 2025-09-11
work page 2025
-
[24]
Heinrich Heine University Düsseldorf, Group for Software Engineering and Programming Languages. 2025. Rodin User’s Handbook v.2.8. https://stups.hhu-hosting.de/handbook/rodin/current/html/introduction.html Accessed: 2025-09-11
work page 2025
-
[25]
Yang Hong, Shan Jiang, Yulei Fu, and Sarfraz Khurshid. 2025. On the Effectiveness of Large Language Models in Writing Alloy Formulas.arXiv preprint arXiv:2502.15441(2025). doi:10.48550/arXiv.2502.15441
-
[26]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79. doi:10.1145/3695988
-
[27]
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation.arXiv preprint arXiv:2406.00515(2024). doi:10.1145/3747588
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3747588 2024
-
[28]
Tsutomu Kobayashi and Fuyuki Ishikawa. 2024. Repairing Event-B Models Through Quantifier Elimination. In International Conference on Formal Engineering Methods. Springer, 18–36. doi:10.1007/978-981-96-0617-7_2
-
[29]
Adarsh Kumarappan, Mo Tiwari, Peiyang Song, Robert Joseph George, Chaowei Xiao, and Anima Anandkumar. 2024. LeanAgent: Lifelong Learning for Formal Theorem Proving.arXiv preprint arXiv:2410.06209(2024). doi:10.48550/arXiv. 2410.06209
work page internal anchor Pith review doi:10.48550/arxiv 2024
-
[30]
Haohan Lin, Zhiqing Sun, Sean Welleck, and Yiming Yang. 2024. Lean-star: Learning to interleave thinking and proving. arXiv preprint arXiv:2407.10040(2024). doi:10.48550/arXiv.2407.10040
-
[31]
Minghai Lu, Benjamin Delaware, and Tianyi Zhang. 2024. Proof automation with large language models. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1509–1520. doi:10.1145/3691620. 3695521
-
[32]
Atif Mashkoor, Faqing Yang, and Jean-Pierre Jacquot. 2017. Refinement-based validation of Event-B specifications. Software & Systems Modeling16, 3 (2017), 789–808. doi:10.1007/s10270-016-0514-4
-
[33]
Dominique Méry and Neeraj Kumar Singh. 2011. Automatic code generation from Event-B models. InProceedings of the 2nd Symposium on Information and Communication Technology. 179–188. doi:10.1145/2069216.2069252
-
[34]
Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak, Bartosz Piotrowski, Albert Qiaochu Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, and Yuhuai Wu. 2023. Magnushammer: A transformer-based approach to premise selection.arXiv preprint arXiv:2303.04488(2023). doi:10.48550/arXiv.2303.04488
-
[35]
Yutaka Nagashima and Yilun He. 2018. PaMpeR: proof method recommendation system for Isabelle/HOL. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 362–372. doi:10.1145/3238147.3238210
-
[36]
Yutaka Nagashima and Ramana Kumar. 2017. A proof strategy language and proof script generation for Isabelle/HOL. InInternational Conference on Automated Deduction. Springer, 528–545. doi:10.1007/978-3-319-63046-5_32
-
[37]
OpenAI. 2025. GPT-5. https://openai.com/gpt-5/ Released: 2025-08-07
work page 2025
-
[38]
Jordan Peper, Zhenjiang Mao, Yuang Geng, Siyuan Pan, and Ivan Ruchkin. 2025. Four Principles for Physically Interpretable World Models.arXiv preprint arXiv:2503.02143(2025). doi:10.48550/arXiv.2503.02143
-
[39]
Victor Rivera, Néstor Catano, Tim Wahls, and Camilo Rueda. 2017. Code generation for Event-B.International Journal on Software Tools for Technology Transfer19, 1 (2017), 31–52. doi:10.1007/s10009-015-0381-2
-
[40]
Peter Riviere, Neeraj Kumar Singh, Yamine Aït-Ameur, and Guillaume Dupont. 2023. Formalising liveness properties in Event-B with the reflexive EB4EB framework. InNASA Formal Methods Symposium. Springer, 312–331. doi:10.1007/978- 3-031-33170-1_19
-
[41]
Peter Riviere, Neeraj Kumar Singh, Yamine Ait-Ameur, and Guillaume Dupont. 2025. Extending the EB4EB framework with parameterised events.Science of Computer Programming243 (2025), 103279. doi:10.1016/j.scico.2025.103279
-
[42]
Ameesh Shah, Marcell Vazquez-Chanlatte, Sebastian Junges, and Sanjit A Seshia. 2023. Learning formal specifications from membership and preference queries.arXiv preprint arXiv:2307.10434(2023). doi:10.48550/arXiv.2307.10434
-
[43]
Jun Sun, Yang Liu, Jin Song Dong, and Jun Pang. 2009. PAT: Towards flexible verification under fairness. InInternational conference on computer aided verification. Springer, 709–714. doi:10.1007/978-3-642-02658-4_59
-
[44]
Haiming Wang, Huajian Xin, Zhengying Liu, Wenda Li, Yinya Huang, Jianqiao Lu, Zhicheng Yang, Jing Tang, Jian Yin, Zhenguo Li, and Xiaodan Liang. 2024. Proving Theorems Recursively. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc.,...
-
[45]
Cheng Wen, Jialun Cao, Jie Su, Zhiwu Xu, Shengchao Qin, Mengda He, Haokun Li, Shing-Chi Cheung, and Cong Tian. 2024. Enchanting program specification synthesis by large language models using static analysis and program verification. InInternational Conference on Computer Aided Verification. Springer, 302–328. doi:10.1007/978-3-031- 65630-9_16 Proc. ACM So...
-
[46]
Haoze Wu, Clark Barrett, and Nina Narodytska. 2023. Lemur: Integrating large language models in automated program verification.arXiv preprint arXiv:2310.04870(2023). doi:10.48550/arXiv.2310.04870
-
[47]
Beyazit Yalcinkaya, Niklas Lauffer, Marcell Vazquez-Chanlatte, and Sanjit A Seshia. 2025. Provably Correct Automata Embeddings for Optimal Automata-Conditioned Reinforcement Learning.arXiv preprint arXiv:2503.05042(2025). doi:10.48550/arXiv.2503.05042
-
[48]
Kaiyu Yang, Aidan Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan J Prenger, and Animashree Anandkumar. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associ...
work page 2023
-
[49]
Yedi Zhang, Yufan Cai, Xinyue Zuo, Xiaokun Luan, Kailong Wang, Zhe Hou, Yifan Zhang, Zhiyuan Wei, Meng Sun, Jun Sun, Jing Sun, and Jin Song Dong. 2025. Position: Trustworthy AI Agents Require the Integration of Large Language Models and Formal Methods. InForty-second International Conference on Machine Learning Position Paper Track. https://openreview.net...
work page 2025
-
[50]
Xinyue Zuo, Yifan Zhang, Hongshu Wang, Yufan Cai, Zhe Hou, Jing Sun, and Jin Song Dong. 2025. PAT-Agent: Autoformalization for Model Checking. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2122–2133. doi:10.48550/arXiv.2509.23675 Received 2025-09-12; accepted 2026-03-24 Proc. ACM Softw. Eng., Vol. 3, No. FSE,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.