Formal-Method-Guided Vibe Coding: Closing the Verification Loop on AI-Generated Safety-Critical Software Through Model-Driven Engineering
Pith reviewed 2026-06-26 10:06 UTC · model grok-4.3
The pith
A closed-loop pipeline extracts formal models from LLM-generated Java and feeds verification failures back as prompts until certifiable evidence is produced.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Forge pipeline uses model transformations to map LLM-generated Java into three complementary formalisms (Dafny, CSP and Z-Machines), runs the corresponding verifiers, and converts every failure into a correction prompt for the next LLM iteration; the loop continues until all verifiers succeed, at which point the generated Java is accompanied by standards-relevant verification evidence.
What carries the argument
The Forge closed-loop pipeline that performs model transformations from Java to Dafny, CSP and Z-Machines and routes verification counter-examples back into LLM prompts.
If this is right
- Java code produced by vibe coding can be accompanied by formal evidence acceptable to certification standards.
- Developers interact only with natural language and ordinary Java, never with the formal models themselves.
- Three distinct verification techniques provide cross-checks that a single formalism would miss.
- Each verification failure automatically supplies the next correction prompt, driving iterative improvement without manual intervention.
Where Pith is reading between the lines
- The same transformation-plus-feedback pattern could be applied to other industrial languages if equivalent model mappings exist.
- Certification authorities might accept the generated evidence packages only after independent audit of the transformation soundness.
- The approach implicitly assumes that the verifiers themselves are trusted; any soundness gap in Dafny, FDR4 or Isabelle would propagate to the final claim.
Load-bearing premise
The model transformations must correctly and completely preserve the semantics of the generated Java in all three formalisms.
What would settle it
A Java program that passes all three verifiers after transformation but still contains a safety violation (for example an array bounds error) that none of the formal models captured.
Figures
read the original abstract
Vibe coding -- accepting LLM-generated source from natural-language intent with minimal review -- is fast and may be adequate for low-criticality consumer software. But for safety-critical systems governed by DO-178C, IEC 61508, or ISO 26262, it offers no path to certification: large language models (LLMs) provide no formal correctness guarantees, and existing remedies target verification-aware languages (Dafny, Verus, Lean) that are scarce in pretraining data and absent from industrial toolchains. This paper closes the gap. We present Forge (Formal method Oriented Refinement loop for GEnerated code): a closed-loop pipeline that guides vibe coding through formal verification using established Model-Driven Engineering (MDE) infrastructure. Through vibe coding, we generate Java source code; our pipeline then extracts -- via model transformations -- formal artefacts in three different formalisms, each checked by a complementary verifier: deductive verification (Dafny), Communicating Sequential Processes (CSP) refinement via the Failures-Divergences Refinement checker (FDR4), and theorem proving using Z-Machines in Isabelle; every verification failure becomes a structured correction prompt that drives the next code-generation iteration. The LLM is the draft generator, the MDE chain is the discriminator, and the developer never has to read the formal models. Empirically, we find that the pipeline produces standards-relevant verification evidence for LLM-generated Java -- a step toward certification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Forge, a closed-loop pipeline that generates Java code via LLM 'vibe coding' from natural-language intent, then applies Model-Driven Engineering transformations to extract formal models in Dafny, CSP, and Z-Machines. These models are verified independently with Dafny (deductive verification), FDR4 (CSP refinement), and Isabelle (Z theorem proving); verification failures are converted into structured prompts to drive LLM iterations. The central claim is that this approach empirically produces standards-relevant verification evidence for LLM-generated Java, providing a path toward certification under DO-178C, IEC 61508, or ISO 26262 without requiring developers to work directly in formal languages.
Significance. If the transformations are semantics-preserving and the empirical results are reproducible, the work could meaningfully connect informal LLM-generated code with the formal artifacts required for safety-critical certification. The multi-formalism approach (deductive, process-algebraic, and theorem-proving) is a positive design choice that leverages complementary strengths of existing tools.
major comments (2)
- [Abstract] Abstract and pipeline description: the claim that the pipeline 'empirically' produces 'standards-relevant verification evidence' for LLM-generated Java is unsupported because the manuscript supplies no experimental design, case-study count, success metrics, iteration statistics, or baseline comparisons. Without these, it is impossible to evaluate whether the verification outcomes actually support certification claims.
- [Forge pipeline description] Forge pipeline / MDE transformations section: the Java-to-Dafny, Java-to-CSP, and Java-to-Z transformations are presented as reliable extractors whose outputs can be trusted for certification evidence, yet no soundness argument, semantic-preservation proof, equivalence checker, or differential-testing validation is supplied. This is load-bearing: any discrepancy in exception semantics, integer overflow, or concurrency model means the Dafny/FDR4/Isabelle results apply only to the transformed artifact, not the original Java.
minor comments (1)
- [Introduction] The term 'vibe coding' is used repeatedly without a precise definition or citation to the prior literature on LLM code generation from natural language.
Simulated Author's Rebuttal
We thank the referee for the constructive and precise comments. We address each major point below, indicating where revisions are required and where we maintain the original framing.
read point-by-point responses
-
Referee: [Abstract] Abstract and pipeline description: the claim that the pipeline 'empirically' produces 'standards-relevant verification evidence' for LLM-generated Java is unsupported because the manuscript supplies no experimental design, case-study count, success metrics, iteration statistics, or baseline comparisons. Without these, it is impossible to evaluate whether the verification outcomes actually support certification claims.
Authors: The referee is correct that the current text asserts an empirical outcome without presenting the supporting experimental protocol, case counts, metrics, or baselines. The manuscript's contribution is primarily the pipeline architecture and the closed-loop integration of MDE with LLM generation; the empirical claim was intended as a high-level summary of internal validation runs rather than a completed evaluation study. We will revise the abstract to remove the unqualified empirical claim and add a new Evaluation section that reports the experimental design, number of case studies, success rates, iteration statistics, and any baseline comparisons performed. revision: yes
-
Referee: [Forge pipeline description] Forge pipeline / MDE transformations section: the Java-to-Dafny, Java-to-CSP, and Java-to-Z transformations are presented as reliable extractors whose outputs can be trusted for certification evidence, yet no soundness argument, semantic-preservation proof, equivalence checker, or differential-testing validation is supplied. This is load-bearing: any discrepancy in exception semantics, integer overflow, or concurrency model means the Dafny/FDR4/Isabelle results apply only to the transformed artifact, not the original Java.
Authors: We agree that the absence of an explicit soundness argument for the three transformations is a substantive gap for any certification-related claim. The transformations are implemented using established MDE frameworks (ATL and QVT) whose metamodels are derived from the Java language specification, but the manuscript does not supply a dedicated preservation proof or differential-testing harness. We will add a subsection that (a) states the semantic-preservation assumptions inherited from the chosen MDE tooling, (b) identifies the specific Java features (e.g., exception handling, integer overflow, concurrency) for which preservation is currently unproven, and (c) outlines planned validation steps. We do not claim that the current transformations are already certified; the revision will make this limitation explicit. revision: yes
Circularity Check
No significant circularity; pipeline relies on external independent verifiers
full rationale
The paper describes a closed-loop pipeline in which an LLM generates Java, MDE transformations produce models in Dafny/CSP/Z, and independent tools (Dafny, FDR4, Isabelle) perform verification whose failures are fed back as prompts. No equations, fitted parameters, predictions, or self-citations appear in the provided text. The central claim—that the pipeline yields standards-relevant verification evidence—rests on the outputs of those external verifiers rather than any self-referential definition or reduction of results to inputs by construction. The soundness of the transformations is an external assumption, not a circular step internal to the derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Model transformations from Java to Dafny, CSP, and Z-Machines preserve the relevant semantics for verification purposes.
invented entities (1)
-
Forge pipeline
no independent evidence
Reference graph
Works this paper leans on
-
[1]
2022.Parsing and Printing Java 7–15 by Extending an Existing Metamodel
Martin Armbruster. 2022.Parsing and Printing Java 7–15 by Extending an Existing Metamodel. Technical Report. Karlsruhe Institute of Technology. doi:10.5445/IR/1000149186
-
[2]
Mike Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K Rustan M Leino. 2005. Boogie: A Modular Reusable Verifier for Object- Oriented Programs. InFormal Methods for Components and Objects (FMCO 2005) (Lecture Notes in Computer Science, Vol. 4111). Springer, 364–387. doi:10.1007/11804192_17
-
[3]
James Baxter, Pedro Ribeiro, and Ana Cavalcanti. 2022. Sound Reasoning in tock-CSP.Acta Informatica59 (2022), 125–162
2022
-
[4]
Artur Boronat and Jawad Mustafa. 2025. MDRE-LLM: A Tool for Analyzing and Applying LLMs in Software Reverse Engineering. InSANER. IEEE, 850–854
2025
-
[5]
Hugo Bruneliere, Jordi Cabot, Grégoire Dupé, and Frédéric Madiot. 2014. MoDisco: A Model Driven Reverse Engineering Framework.Information and Software Technology56, 8 (2014), 1012–1032
2014
-
[6]
Victor Campanello, Shariq Shahbaz, Vladislav Indykov, and Daniel Strüber. 2025. On the use of GPT-4 in the reverse engineering of class diagrams. Journal of Object Technology24, 2 (2025), 1–14
2025
-
[7]
Javier Luis Cánovas Izquierdo and Jesús García Molina. 2014. Extracting Models from Source Code in Software Modernization.Software and Systems Modeling13, 2 (2014), 713–734
2014
-
[8]
Krzysztof Czarnecki and Simon Helsen. 2006. Feature-Based Survey of Model Transformation Approaches.IBM Systems Journal45, 3 (2006), 621–645. doi:10.1147/sj.453.0621
-
[9]
Rafael S Durelli, Daniel SM Santibánez, Bruno Marinho, Raphael Honda, Márcio E Delamaro, Nicolas Anquetil, and Valter Vieira de Camargo
-
[10]
InProceedings of the 2014 IEEE 15th international conference on information reuse and integration (IEEE IRI 2014)
A mapping study on architecture-driven modernization. InProceedings of the 2014 IEEE 15th international conference on information reuse and integration (IEEE IRI 2014). IEEE, 577–584
2014
-
[11]
Eclipse Epsilon Project. 2024. Epsilon EMC JDT Driver: On-Demand Java Model Connectivity via Eclipse JDT. https://eclipse.dev/epsilon/doc/ articles/jdt/ Accessed: 2025-12-01
2024
-
[12]
Eclipse Foundation. 2024. Acceleo: M2T Code Generation. https://eclipse.dev/acceleo/. Implementing the OMG MOF Model to Text Transformation Language (MOFM2T) standard
2024
-
[13]
Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53
2023
-
[14]
Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. 2025. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook–a Grey Literature Review.arXiv preprint arXiv:2510.00328(2025)
arXiv 2025
-
[15]
Simon Foster, Yakoub Nemouchi, Mario Gleirscher, Ran Wei, and Tim Kelly. 2021. Integration of Formal Proof into Unified Assurance Cases with Isabelle/SACM.Formal Aspects of Computing33, 6 (2021), 855–884
2021
-
[16]
Simon Foster, Yakoub Nemouchi, Colin O’Halloran, Karen Stephenson, and Nick Tudor. 2020. Formal Model-Based Assurance Cases in Isabelle/SACM: An Autonomous Underwater Vehicle Case Study. InFormaliSE 2020. ACM, 11–21
2020
-
[17]
Thomas Gibson-Robinson, Philip Armstrong, Alexandre Boulgakov, and Andrew W Roscoe. 2014. FDR3: A Modern Refinement Checker for CSP. In TACAS (LNCS, Vol. 8413). Springer, 187–201
2014
-
[18]
Florian Heidenreich, Jendrik Johannes, Mirko Seifert, and Christian Wende. 2010. Closing the Gap between Modelling and Java. InSoftware Language Engineering (SLE 2009) (LNCS, Vol. 5969). Springer, 374–383. doi:10.1007/978-3-642-12107-4_25
-
[19]
Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. 2023. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation.arXiv preprint arXiv:2312.13010(2023)
Pith/arXiv arXiv 2023
-
[20]
2010.IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems
IEC. 2010.IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems. Technical Report. International Electrotechnical Commission
2010
-
[21]
ISO. 2018. Road Vehicles – Functional Safety
2018
-
[22]
Java Community Process. 2021. JSR-302: Safety-Critical Java Technology. Java Specification Request, public review draft. Started 2006; under JCP standardisation. https://jcp.org/en/jsr/detail?id=302
2021
-
[23]
Andrej Karpathy. 2025. Vibe Coding. X (formerly Twitter). https://x.com/karpathy/status/1886192184808149383
arXiv 2025
-
[24]
Tim Kelly. 2004. A Systematic Approach to Safety Case Management. InSAE 2004 World Congress & Exhibition. SAE Technical Paper
2004
-
[25]
Dimitrios S Kolovos, Richard F Paige, and Fiona AC Polack. 2008. The Epsilon Transformation Language. InICMT 2008 (LNCS, Vol. 5063). Springer, 46–60
2008
-
[26]
Jagun Kwon, Andy Wellings, and Steve King. 2005. Ravenscar-Java: A high-integrity profile for real-time Java.Concurrency and Computation: Practice and Experience17, 5–6 (2005), 681–713. doi:10.1002/cpe.843 Originally presented in 2002; journal version 2005
-
[27]
Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. 2023. Verus: Verifying rust programs using linear ghost types.Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 286–315
2023
-
[28]
K Rustan M Leino. 2010. Dafny: An automatic program verifier for functional correctness. InInternational conference on logic for programming artificial intelligence and reasoning. Springer, 348–370. Manuscript submitted to ACM 22 Ran Wei, Le Zhu, Haochi Wang, Jim Woodcock, Fang Yan, Simon Foster, and Xiangyang Ji
2010
-
[29]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems36 (2023), 21558–21572
2023
-
[30]
Dennis, Clare Dixon, and Michael Fisher
Matt Luckcuck, Marie Farrell, Louise A. Dennis, Clare Dixon, and Michael Fisher. 2019. Formal Specification and Verification of Autonomous Robotic Systems: A Survey.Comput. Surveys52, 5 (2019), 100:1–100:41. doi:10.1145/3342355
-
[31]
Tom Mens and Pieter Van Gorp. 2006. A Taxonomy of Model Transformation.Electronic Notes in Theoretical Computer Science152 (2006), 125–142. doi:10.1016/j.entcs.2005.10.021
-
[32]
MISRA. 2012. MISRA C:2012: Guidelines for the Use of the C Language in Critical Systems
2012
-
[33]
Md Rakib Hossain Misu, Cristina V Lopes, Iris Ma, and James Noble. 2024. Towards ai-assisted synthesis of verified dafny methods.Proceedings of the ACM on Software Engineering1, FSE (2024), 812–835
2024
-
[34]
Alvaro Miyazawa, Pedro Ribeiro, Wei Li, Ana Cavalcanti, Jon Timmis, and Jim Woodcock. 2019. RoboChart: Modelling and Verification of the Functional Behaviour of Robotic Applications.Software and Systems Modeling18, 5 (2019), 3097–3149
2019
-
[35]
Leonardo de Moura and Sebastian Ullrich. 2021. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction. Springer, 625–635
2021
-
[36]
Prasita Mukherjee, Minghai Lu, and Benjamin Delaware. 2025. LLM-Assisted Synthesis of High-Assurance C Programs. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3108
2025
-
[37]
Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson. 2002.Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Lecture Notes in Computer Science, Vol. 2283. Springer. doi:10.1007/3-540-45949-9
-
[38]
2015.XML Metadata Interchange (XMI) Specification
Object Management Group. 2015.XML Metadata Interchange (XMI) Specification. Technical Report Version 2.5.1. OMG. https://www.omg.org/spec/ XMI/
2015
-
[39]
2020.Structured Assurance Case Metamodel Specification
Object Management Group. 2020.Structured Assurance Case Metamodel Specification. Technical Report. OMG. https://www.omg.org/spec/SACM/
2020
-
[40]
Renaud Pawlak, Martin Monperrus, Nicolas Petitprez, Carlos Noguera, and Lionel Seinturier. 2015. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code.Software: Practice and Experience46, 9 (2015), 1155–1179. doi:10.1002/spe.2346
-
[41]
Veronica Pimenova, Sarah Fakhoury, Christian Bird, Margaret-Anne Storey, and Madeline Endres. 2025. Good Vibrations? A Qualitative Study of Co-Creation, Communication, Flow, and Trust in Vibe Coding. arXiv:2509.12491 [cs.SE] https://arxiv.org/abs/2509.12491
arXiv 2025
-
[42]
Louis M. Rose, Richard F. Paige, Dimitrios S. Kolovos, and Fiona A. C. Polack. 2008. The Epsilon Generation Language. InModel Driven Architecture– Foundations and Applications (ECMDA-FA 2008) (Lecture Notes in Computer Science, Vol. 5095). Springer, 1–16. doi:10.1007/978-3-540-69100-6_1
-
[43]
2011.DO-178C: Software Considerations in Airborne Systems and Equipment Certification
RTCA. 2011.DO-178C: Software Considerations in Airborne Systems and Equipment Certification. Technical Report. RTCA Inc
2011
-
[44]
2011.DO-332: Object-Oriented Technology and Related Techniques Supplement to DO-178C and DO-278A
RTCA. 2011.DO-332: Object-Oriented Technology and Related Techniques Supplement to DO-178C and DO-278A. RTCA, Inc. / EUROCAE (as ED-217), Washington, DC. Published December 2011
2011
-
[45]
Shane Sendall and Wojtek Kozaczynski. 2003. Model Transformation: The Heart and Soul of Model-Driven Software Development.IEEE Software20, 5 (2003), 42–45. doi:10.1109/MS.2003.1231150
-
[46]
Hanan Siala and Kevin Lano. 2025. A comparison of large language models and model-driven reverse engineering for reverse engineering.Frontiers in Computer Science7 (2025), 1516410
2025
-
[47]
2008.EMF: Eclipse Modeling Framework(2nd ed.)
Dave Steinberg, Frank Budinsky, Marcelo Paternostro, and Ed Merks. 2008.EMF: Eclipse Modeling Framework(2nd ed.). Addison-Wesley
2008
-
[48]
Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. 2024. Clover: Closed-loop verifiable code generation. InInternational Symposium on AI Verification. Springer, 134–155
2024
-
[49]
Florian Tambon, Arghavan Moradi-Dakhel, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Giuliano Antoniol. 2025. Bugs in large language models generated code: An empirical study.Empirical Software Engineering30, 3 (2025), 65
2025
-
[50]
Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri
-
[51]
CLEVER: A Curated Benchmark for Formally Verified Code Generation.Advances in Neural Information Processing Systems38 (2026)
2026
-
[52]
The HIJA Consortium. 2007. High-Integrity Java Applications (HIJA): A new programming standard for safety–critical embedded systems. European Commission IST project IST-2001-35256 final report. Project ran 2004–2007
2007
-
[53]
Ran Wei, Simon Foster, Haitao Mei, Fang Yan, Ruizhe Yang, Ibrahim Habli, Colin O’Halloran, Nick Tudor, Tim Kelly, and Yakoub Nemouchi. 2024. ACCESS: Assurance case centric engineering of safety–critical systems.Journal of Systems and Software213 (2024), 112034
2024
-
[54]
Ran Wei, Tim P Kelly, Xiaotian Dai, Shuai Zhao, and Richard Hawkins. 2019. Model Based System Assurance Using the Structured Assurance Case Metamodel.Journal of Systems and Software154 (2019), 211–233
2019
-
[55]
Jim Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. 2009. Formal Methods: Practice and Experience.Comput. Surveys41, 4 (2009), 19:1–19:36. doi:10.1145/1592434.1592436
-
[56]
Awadallah, Ryen W
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed H. Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. In COLM
2024
-
[57]
Xu Xu, Xin Li, Xingwei Qu, Jie Fu, and Binhang Yuan. 2025. Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification.arXiv preprint arXiv:2509.23061(2025)
arXiv 2025
-
[58]
Fang Yan, Simon Foster, and Ibrahim Habli. 2023. Automated Compositional Verification for Robotic State Machines Using Isabelle/HOL. InICECCS. IEEE, 167–176. Manuscript submitted to ACM
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.