Formal-Method-Guided Vibe Coding: Closing the Verification Loop on AI-Generated Safety-Critical Software Through Model-Driven Engineering

Fang Yan; Haochi Wang; Jim Woodcock; Le Zhu; Ran Wei; Simon Foster; Xiangyang Ji

arxiv: 2606.22413 · v2 · pith:WRVJGJA5new · submitted 2026-06-21 · 💻 cs.SE · cs.FL

Formal-Method-Guided Vibe Coding: Closing the Verification Loop on AI-Generated Safety-Critical Software Through Model-Driven Engineering

Ran Wei , Le Zhu , Haochi Wang , Jim Woodcock , Fang Yan , Simon Foster , Xiangyang Ji This is my paper

Pith reviewed 2026-06-26 10:06 UTC · model grok-4.3

classification 💻 cs.SE cs.FL

keywords vibe codingLLM-generated codeformal verificationmodel-driven engineeringsafety-critical softwareDafnyCSP refinementZ notation

0 comments

The pith

A closed-loop pipeline extracts formal models from LLM-generated Java and feeds verification failures back as prompts until certifiable evidence is produced.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Forge, a pipeline that starts with natural-language intent to generate Java via an LLM, then applies model-driven transformations to produce equivalent artefacts in Dafny for deductive verification, CSP for refinement checking with FDR4, and Z-Machines for theorem proving in Isabelle. Failures in any verifier are turned into structured prompts that drive the next generation round, creating an automated loop the developer never needs to enter directly. The result is claimed to be verification artefacts that satisfy the evidence requirements of standards such as DO-178C, IEC 61508 and ISO 26262 for the original Java source. A sympathetic reader sees this as a way to keep the speed of vibe coding while obtaining the formal guarantees that certification bodies require.

Core claim

The Forge pipeline uses model transformations to map LLM-generated Java into three complementary formalisms (Dafny, CSP and Z-Machines), runs the corresponding verifiers, and converts every failure into a correction prompt for the next LLM iteration; the loop continues until all verifiers succeed, at which point the generated Java is accompanied by standards-relevant verification evidence.

What carries the argument

The Forge closed-loop pipeline that performs model transformations from Java to Dafny, CSP and Z-Machines and routes verification counter-examples back into LLM prompts.

If this is right

Java code produced by vibe coding can be accompanied by formal evidence acceptable to certification standards.
Developers interact only with natural language and ordinary Java, never with the formal models themselves.
Three distinct verification techniques provide cross-checks that a single formalism would miss.
Each verification failure automatically supplies the next correction prompt, driving iterative improvement without manual intervention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transformation-plus-feedback pattern could be applied to other industrial languages if equivalent model mappings exist.
Certification authorities might accept the generated evidence packages only after independent audit of the transformation soundness.
The approach implicitly assumes that the verifiers themselves are trusted; any soundness gap in Dafny, FDR4 or Isabelle would propagate to the final claim.

Load-bearing premise

The model transformations must correctly and completely preserve the semantics of the generated Java in all three formalisms.

What would settle it

A Java program that passes all three verifiers after transformation but still contains a safety violation (for example an array bounds error) that none of the formal models captured.

Figures

Figures reproduced from arXiv: 2606.22413 by Fang Yan, Haochi Wang, Jim Woodcock, Le Zhu, Ran Wei, Simon Foster, Xiangyang Ji.

**Figure 1.** Figure 1: Architecture of Forge. Requirements and generation (Phases 1–2) produces the requirement set and then Java (with the 2a/2b/2c preflight checks). Extraction (Phases 3–5) parses the Java to an EMF model (T2M), transforms it to a RoboChart state machine (M2M), and emits three formal artefacts (M2T): Dafny from the Java EMF (dashed bypass), CSP semantics and Z-Machine notation from the RoboChart model. Verific… view at source ↗

read the original abstract

Vibe coding -- accepting LLM-generated source from natural-language intent with minimal review -- is fast and may be adequate for low-criticality consumer software. But for safety-critical systems governed by DO-178C, IEC 61508, or ISO 26262, it offers no path to certification: large language models (LLMs) provide no formal correctness guarantees, and existing remedies target verification-aware languages (Dafny, Verus, Lean) that are scarce in pretraining data and absent from industrial toolchains. This paper closes the gap. We present Forge (Formal method Oriented Refinement loop for GEnerated code): a closed-loop pipeline that guides vibe coding through formal verification using established Model-Driven Engineering (MDE) infrastructure. Through vibe coding, we generate Java source code; our pipeline then extracts -- via model transformations -- formal artefacts in three different formalisms, each checked by a complementary verifier: deductive verification (Dafny), Communicating Sequential Processes (CSP) refinement via the Failures-Divergences Refinement checker (FDR4), and theorem proving using Z-Machines in Isabelle; every verification failure becomes a structured correction prompt that drives the next code-generation iteration. The LLM is the draft generator, the MDE chain is the discriminator, and the developer never has to read the formal models. Empirically, we find that the pipeline produces standards-relevant verification evidence for LLM-generated Java -- a step toward certification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Forge describes a three-formalism MDE loop to verify LLM Java code but the transformations' soundness is unaddressed and experiments are missing from the abstract.

read the letter

The paper presents Forge, a pipeline that starts with LLM-generated Java from natural-language prompts, extracts formal models via MDE into Dafny, CSP, and Z, runs them through Dafny, FDR4, and Isabelle respectively, and turns verification failures into structured prompts for the next iteration. The developer never sees the formal artifacts.

What is new is the specific combination of vibe coding with MDE extraction across three complementary formalisms and an automated feedback loop aimed at certification standards. It correctly identifies that existing fixes push developers toward verification languages that are rare in practice and pretraining data.

The architecture makes sense on paper: the MDE chain acts as an independent discriminator, and reusing mature tools like FDR4 and Isabelle avoids reinventing verifiers. This could be useful for teams that want to keep Java in the toolchain while still generating some verification evidence.

The soft spots are clear. The abstract claims the pipeline produces standards-relevant evidence but supplies no experimental design, test cases, success rates, iteration counts, or comparisons. More critically, the Java-to-Dafny, Java-to-CSP, and Java-to-Z transformations are treated as reliable black boxes. If any of them drops exception semantics, mishandles concurrency, or alters overflow behavior, the proofs apply only to the extracted models, not the original code. The stress-test concern holds on the material provided; no soundness argument or differential testing is mentioned.

This is for people working at the intersection of AI-assisted development and high-assurance systems. A reader already familiar with MDE and the cited verifiers would get the most from the architecture description.

Send it for peer review. The core idea is coherent enough to deserve referee time, provided the full paper supplies the missing experimental data and addresses transformation validity.

Referee Report

2 major / 1 minor

Summary. The paper presents Forge, a closed-loop pipeline that generates Java code via LLM 'vibe coding' from natural-language intent, then applies Model-Driven Engineering transformations to extract formal models in Dafny, CSP, and Z-Machines. These models are verified independently with Dafny (deductive verification), FDR4 (CSP refinement), and Isabelle (Z theorem proving); verification failures are converted into structured prompts to drive LLM iterations. The central claim is that this approach empirically produces standards-relevant verification evidence for LLM-generated Java, providing a path toward certification under DO-178C, IEC 61508, or ISO 26262 without requiring developers to work directly in formal languages.

Significance. If the transformations are semantics-preserving and the empirical results are reproducible, the work could meaningfully connect informal LLM-generated code with the formal artifacts required for safety-critical certification. The multi-formalism approach (deductive, process-algebraic, and theorem-proving) is a positive design choice that leverages complementary strengths of existing tools.

major comments (2)

[Abstract] Abstract and pipeline description: the claim that the pipeline 'empirically' produces 'standards-relevant verification evidence' for LLM-generated Java is unsupported because the manuscript supplies no experimental design, case-study count, success metrics, iteration statistics, or baseline comparisons. Without these, it is impossible to evaluate whether the verification outcomes actually support certification claims.
[Forge pipeline description] Forge pipeline / MDE transformations section: the Java-to-Dafny, Java-to-CSP, and Java-to-Z transformations are presented as reliable extractors whose outputs can be trusted for certification evidence, yet no soundness argument, semantic-preservation proof, equivalence checker, or differential-testing validation is supplied. This is load-bearing: any discrepancy in exception semantics, integer overflow, or concurrency model means the Dafny/FDR4/Isabelle results apply only to the transformed artifact, not the original Java.

minor comments (1)

[Introduction] The term 'vibe coding' is used repeatedly without a precise definition or citation to the prior literature on LLM code generation from natural language.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments. We address each major point below, indicating where revisions are required and where we maintain the original framing.

read point-by-point responses

Referee: [Abstract] Abstract and pipeline description: the claim that the pipeline 'empirically' produces 'standards-relevant verification evidence' for LLM-generated Java is unsupported because the manuscript supplies no experimental design, case-study count, success metrics, iteration statistics, or baseline comparisons. Without these, it is impossible to evaluate whether the verification outcomes actually support certification claims.

Authors: The referee is correct that the current text asserts an empirical outcome without presenting the supporting experimental protocol, case counts, metrics, or baselines. The manuscript's contribution is primarily the pipeline architecture and the closed-loop integration of MDE with LLM generation; the empirical claim was intended as a high-level summary of internal validation runs rather than a completed evaluation study. We will revise the abstract to remove the unqualified empirical claim and add a new Evaluation section that reports the experimental design, number of case studies, success rates, iteration statistics, and any baseline comparisons performed. revision: yes
Referee: [Forge pipeline description] Forge pipeline / MDE transformations section: the Java-to-Dafny, Java-to-CSP, and Java-to-Z transformations are presented as reliable extractors whose outputs can be trusted for certification evidence, yet no soundness argument, semantic-preservation proof, equivalence checker, or differential-testing validation is supplied. This is load-bearing: any discrepancy in exception semantics, integer overflow, or concurrency model means the Dafny/FDR4/Isabelle results apply only to the transformed artifact, not the original Java.

Authors: We agree that the absence of an explicit soundness argument for the three transformations is a substantive gap for any certification-related claim. The transformations are implemented using established MDE frameworks (ATL and QVT) whose metamodels are derived from the Java language specification, but the manuscript does not supply a dedicated preservation proof or differential-testing harness. We will add a subsection that (a) states the semantic-preservation assumptions inherited from the chosen MDE tooling, (b) identifies the specific Java features (e.g., exception handling, integer overflow, concurrency) for which preservation is currently unproven, and (c) outlines planned validation steps. We do not claim that the current transformations are already certified; the revision will make this limitation explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline relies on external independent verifiers

full rationale

The paper describes a closed-loop pipeline in which an LLM generates Java, MDE transformations produce models in Dafny/CSP/Z, and independent tools (Dafny, FDR4, Isabelle) perform verification whose failures are fed back as prompts. No equations, fitted parameters, predictions, or self-citations appear in the provided text. The central claim—that the pipeline yields standards-relevant verification evidence—rests on the outputs of those external verifiers rather than any self-referential definition or reduction of results to inputs by construction. The soundness of the transformations is an external assumption, not a circular step internal to the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available; the ledger is therefore limited to assumptions visible at that level.

axioms (1)

domain assumption Model transformations from Java to Dafny, CSP, and Z-Machines preserve the relevant semantics for verification purposes.
The pipeline depends on these transformations being faithful; any semantic gap would invalidate the verification results.

invented entities (1)

Forge pipeline no independent evidence
purpose: Closed-loop verification system that hides formal models from the developer while producing certification evidence.
New named system introduced to integrate LLM generation with MDE and multiple verifiers.

pith-pipeline@v0.9.1-grok · 5817 in / 1289 out tokens · 26415 ms · 2026-06-26T10:06:09.531694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 12 canonical work pages

[1]

2022.Parsing and Printing Java 7–15 by Extending an Existing Metamodel

Martin Armbruster. 2022.Parsing and Printing Java 7–15 by Extending an Existing Metamodel. Technical Report. Karlsruhe Institute of Technology. doi:10.5445/IR/1000149186

work page doi:10.5445/ir/1000149186 2022
[2]

Mike Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K Rustan M Leino. 2005. Boogie: A Modular Reusable Verifier for Object- Oriented Programs. InFormal Methods for Components and Objects (FMCO 2005) (Lecture Notes in Computer Science, Vol. 4111). Springer, 364–387. doi:10.1007/11804192_17

work page doi:10.1007/11804192_17 2005
[3]

James Baxter, Pedro Ribeiro, and Ana Cavalcanti. 2022. Sound Reasoning in tock-CSP.Acta Informatica59 (2022), 125–162

2022
[4]

Artur Boronat and Jawad Mustafa. 2025. MDRE-LLM: A Tool for Analyzing and Applying LLMs in Software Reverse Engineering. InSANER. IEEE, 850–854

2025
[5]

Hugo Bruneliere, Jordi Cabot, Grégoire Dupé, and Frédéric Madiot. 2014. MoDisco: A Model Driven Reverse Engineering Framework.Information and Software Technology56, 8 (2014), 1012–1032

2014
[6]

Victor Campanello, Shariq Shahbaz, Vladislav Indykov, and Daniel Strüber. 2025. On the use of GPT-4 in the reverse engineering of class diagrams. Journal of Object Technology24, 2 (2025), 1–14

2025
[7]

Javier Luis Cánovas Izquierdo and Jesús García Molina. 2014. Extracting Models from Source Code in Software Modernization.Software and Systems Modeling13, 2 (2014), 713–734

2014
[8]

Krzysztof Czarnecki and Simon Helsen. 2006. Feature-Based Survey of Model Transformation Approaches.IBM Systems Journal45, 3 (2006), 621–645. doi:10.1147/sj.453.0621

work page doi:10.1147/sj.453.0621 2006
[9]

Rafael S Durelli, Daniel SM Santibánez, Bruno Marinho, Raphael Honda, Márcio E Delamaro, Nicolas Anquetil, and Valter Vieira de Camargo
[10]

InProceedings of the 2014 IEEE 15th international conference on information reuse and integration (IEEE IRI 2014)

A mapping study on architecture-driven modernization. InProceedings of the 2014 IEEE 15th international conference on information reuse and integration (IEEE IRI 2014). IEEE, 577–584

2014
[11]

Eclipse Epsilon Project. 2024. Epsilon EMC JDT Driver: On-Demand Java Model Connectivity via Eclipse JDT. https://eclipse.dev/epsilon/doc/ articles/jdt/ Accessed: 2025-12-01

2024
[12]

Eclipse Foundation. 2024. Acceleo: M2T Code Generation. https://eclipse.dev/acceleo/. Implementing the OMG MOF Model to Text Transformation Language (MOFM2T) standard

2024
[13]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

2023
[14]

Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. 2025. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook–a Grey Literature Review.arXiv preprint arXiv:2510.00328(2025)

arXiv 2025
[15]

Simon Foster, Yakoub Nemouchi, Mario Gleirscher, Ran Wei, and Tim Kelly. 2021. Integration of Formal Proof into Unified Assurance Cases with Isabelle/SACM.Formal Aspects of Computing33, 6 (2021), 855–884

2021
[16]

Simon Foster, Yakoub Nemouchi, Colin O’Halloran, Karen Stephenson, and Nick Tudor. 2020. Formal Model-Based Assurance Cases in Isabelle/SACM: An Autonomous Underwater Vehicle Case Study. InFormaliSE 2020. ACM, 11–21

2020
[17]

Thomas Gibson-Robinson, Philip Armstrong, Alexandre Boulgakov, and Andrew W Roscoe. 2014. FDR3: A Modern Refinement Checker for CSP. In TACAS (LNCS, Vol. 8413). Springer, 187–201

2014
[18]

Florian Heidenreich, Jendrik Johannes, Mirko Seifert, and Christian Wende. 2010. Closing the Gap between Modelling and Java. InSoftware Language Engineering (SLE 2009) (LNCS, Vol. 5969). Springer, 374–383. doi:10.1007/978-3-642-12107-4_25

work page doi:10.1007/978-3-642-12107-4_25 2010
[19]

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. 2023. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation.arXiv preprint arXiv:2312.13010(2023)

Pith/arXiv arXiv 2023
[20]

2010.IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems

IEC. 2010.IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems. Technical Report. International Electrotechnical Commission

2010
[21]

ISO. 2018. Road Vehicles – Functional Safety

2018
[22]

Java Community Process. 2021. JSR-302: Safety-Critical Java Technology. Java Specification Request, public review draft. Started 2006; under JCP standardisation. https://jcp.org/en/jsr/detail?id=302

2021
[23]

Andrej Karpathy. 2025. Vibe Coding. X (formerly Twitter). https://x.com/karpathy/status/1886192184808149383

arXiv 2025
[24]

Tim Kelly. 2004. A Systematic Approach to Safety Case Management. InSAE 2004 World Congress & Exhibition. SAE Technical Paper

2004
[25]

Dimitrios S Kolovos, Richard F Paige, and Fiona AC Polack. 2008. The Epsilon Transformation Language. InICMT 2008 (LNCS, Vol. 5063). Springer, 46–60

2008
[26]

Jagun Kwon, Andy Wellings, and Steve King. 2005. Ravenscar-Java: A high-integrity profile for real-time Java.Concurrency and Computation: Practice and Experience17, 5–6 (2005), 681–713. doi:10.1002/cpe.843 Originally presented in 2002; journal version 2005

work page doi:10.1002/cpe.843 2005
[27]

Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. 2023. Verus: Verifying rust programs using linear ghost types.Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 286–315

2023
[28]

K Rustan M Leino. 2010. Dafny: An automatic program verifier for functional correctness. InInternational conference on logic for programming artificial intelligence and reasoning. Springer, 348–370. Manuscript submitted to ACM 22 Ran Wei, Le Zhu, Haochi Wang, Jim Woodcock, Fang Yan, Simon Foster, and Xiangyang Ji

2010
[29]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems36 (2023), 21558–21572

2023
[30]

Dennis, Clare Dixon, and Michael Fisher

Matt Luckcuck, Marie Farrell, Louise A. Dennis, Clare Dixon, and Michael Fisher. 2019. Formal Specification and Verification of Autonomous Robotic Systems: A Survey.Comput. Surveys52, 5 (2019), 100:1–100:41. doi:10.1145/3342355

work page doi:10.1145/3342355 2019
[31]

Tom Mens and Pieter Van Gorp. 2006. A Taxonomy of Model Transformation.Electronic Notes in Theoretical Computer Science152 (2006), 125–142. doi:10.1016/j.entcs.2005.10.021

work page doi:10.1016/j.entcs.2005.10.021 2006
[32]

MISRA. 2012. MISRA C:2012: Guidelines for the Use of the C Language in Critical Systems

2012
[33]

Md Rakib Hossain Misu, Cristina V Lopes, Iris Ma, and James Noble. 2024. Towards ai-assisted synthesis of verified dafny methods.Proceedings of the ACM on Software Engineering1, FSE (2024), 812–835

2024
[34]

Alvaro Miyazawa, Pedro Ribeiro, Wei Li, Ana Cavalcanti, Jon Timmis, and Jim Woodcock. 2019. RoboChart: Modelling and Verification of the Functional Behaviour of Robotic Applications.Software and Systems Modeling18, 5 (2019), 3097–3149

2019
[35]

Leonardo de Moura and Sebastian Ullrich. 2021. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction. Springer, 625–635

2021
[36]

Prasita Mukherjee, Minghai Lu, and Benjamin Delaware. 2025. LLM-Assisted Synthesis of High-Assurance C Programs. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3108

2025
[37]

Paulson, and Markus Wenzel

Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson. 2002.Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Lecture Notes in Computer Science, Vol. 2283. Springer. doi:10.1007/3-540-45949-9

work page doi:10.1007/3-540-45949-9 2002
[38]

2015.XML Metadata Interchange (XMI) Specification

Object Management Group. 2015.XML Metadata Interchange (XMI) Specification. Technical Report Version 2.5.1. OMG. https://www.omg.org/spec/ XMI/

2015
[39]

2020.Structured Assurance Case Metamodel Specification

Object Management Group. 2020.Structured Assurance Case Metamodel Specification. Technical Report. OMG. https://www.omg.org/spec/SACM/

2020
[40]

Renaud Pawlak, Martin Monperrus, Nicolas Petitprez, Carlos Noguera, and Lionel Seinturier. 2015. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code.Software: Practice and Experience46, 9 (2015), 1155–1179. doi:10.1002/spe.2346

work page doi:10.1002/spe.2346 2015
[41]

Veronica Pimenova, Sarah Fakhoury, Christian Bird, Margaret-Anne Storey, and Madeline Endres. 2025. Good Vibrations? A Qualitative Study of Co-Creation, Communication, Flow, and Trust in Vibe Coding. arXiv:2509.12491 [cs.SE] https://arxiv.org/abs/2509.12491

arXiv 2025
[42]

Rose, Richard F

Louis M. Rose, Richard F. Paige, Dimitrios S. Kolovos, and Fiona A. C. Polack. 2008. The Epsilon Generation Language. InModel Driven Architecture– Foundations and Applications (ECMDA-FA 2008) (Lecture Notes in Computer Science, Vol. 5095). Springer, 1–16. doi:10.1007/978-3-540-69100-6_1

work page doi:10.1007/978-3-540-69100-6_1 2008
[43]

2011.DO-178C: Software Considerations in Airborne Systems and Equipment Certification

RTCA. 2011.DO-178C: Software Considerations in Airborne Systems and Equipment Certification. Technical Report. RTCA Inc

2011
[44]

2011.DO-332: Object-Oriented Technology and Related Techniques Supplement to DO-178C and DO-278A

RTCA. 2011.DO-332: Object-Oriented Technology and Related Techniques Supplement to DO-178C and DO-278A. RTCA, Inc. / EUROCAE (as ED-217), Washington, DC. Published December 2011

2011
[45]

Shane Sendall and Wojtek Kozaczynski. 2003. Model Transformation: The Heart and Soul of Model-Driven Software Development.IEEE Software20, 5 (2003), 42–45. doi:10.1109/MS.2003.1231150

work page doi:10.1109/ms.2003.1231150 2003
[46]

Hanan Siala and Kevin Lano. 2025. A comparison of large language models and model-driven reverse engineering for reverse engineering.Frontiers in Computer Science7 (2025), 1516410

2025
[47]

2008.EMF: Eclipse Modeling Framework(2nd ed.)

Dave Steinberg, Frank Budinsky, Marcelo Paternostro, and Ed Merks. 2008.EMF: Eclipse Modeling Framework(2nd ed.). Addison-Wesley

2008
[48]

Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. 2024. Clover: Closed-loop verifiable code generation. InInternational Symposium on AI Verification. Springer, 134–155

2024
[49]

Florian Tambon, Arghavan Moradi-Dakhel, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Giuliano Antoniol. 2025. Bugs in large language models generated code: An empirical study.Empirical Software Engineering30, 3 (2025), 65

2025
[50]

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri
[51]

CLEVER: A Curated Benchmark for Formally Verified Code Generation.Advances in Neural Information Processing Systems38 (2026)

2026
[52]

The HIJA Consortium. 2007. High-Integrity Java Applications (HIJA): A new programming standard for safety–critical embedded systems. European Commission IST project IST-2001-35256 final report. Project ran 2004–2007

2007
[53]

Ran Wei, Simon Foster, Haitao Mei, Fang Yan, Ruizhe Yang, Ibrahim Habli, Colin O’Halloran, Nick Tudor, Tim Kelly, and Yakoub Nemouchi. 2024. ACCESS: Assurance case centric engineering of safety–critical systems.Journal of Systems and Software213 (2024), 112034

2024
[54]

Ran Wei, Tim P Kelly, Xiaotian Dai, Shuai Zhao, and Richard Hawkins. 2019. Model Based System Assurance Using the Structured Assurance Case Metamodel.Journal of Systems and Software154 (2019), 211–233

2019
[55]

Jim Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. 2009. Formal Methods: Practice and Experience.Comput. Surveys41, 4 (2009), 19:1–19:36. doi:10.1145/1592434.1592436

work page doi:10.1145/1592434.1592436 2009
[56]

Awadallah, Ryen W

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed H. Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. In COLM

2024
[57]

Xu Xu, Xin Li, Xingwei Qu, Jie Fu, and Binhang Yuan. 2025. Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification.arXiv preprint arXiv:2509.23061(2025)

arXiv 2025
[58]

Fang Yan, Simon Foster, and Ibrahim Habli. 2023. Automated Compositional Verification for Robotic State Machines Using Isabelle/HOL. InICECCS. IEEE, 167–176. Manuscript submitted to ACM

2023

[1] [1]

2022.Parsing and Printing Java 7–15 by Extending an Existing Metamodel

Martin Armbruster. 2022.Parsing and Printing Java 7–15 by Extending an Existing Metamodel. Technical Report. Karlsruhe Institute of Technology. doi:10.5445/IR/1000149186

work page doi:10.5445/ir/1000149186 2022

[2] [2]

Mike Barnett, Bor-Yuh Evan Chang, Robert DeLine, Bart Jacobs, and K Rustan M Leino. 2005. Boogie: A Modular Reusable Verifier for Object- Oriented Programs. InFormal Methods for Components and Objects (FMCO 2005) (Lecture Notes in Computer Science, Vol. 4111). Springer, 364–387. doi:10.1007/11804192_17

work page doi:10.1007/11804192_17 2005

[3] [3]

James Baxter, Pedro Ribeiro, and Ana Cavalcanti. 2022. Sound Reasoning in tock-CSP.Acta Informatica59 (2022), 125–162

2022

[4] [4]

Artur Boronat and Jawad Mustafa. 2025. MDRE-LLM: A Tool for Analyzing and Applying LLMs in Software Reverse Engineering. InSANER. IEEE, 850–854

2025

[5] [5]

Hugo Bruneliere, Jordi Cabot, Grégoire Dupé, and Frédéric Madiot. 2014. MoDisco: A Model Driven Reverse Engineering Framework.Information and Software Technology56, 8 (2014), 1012–1032

2014

[6] [6]

Victor Campanello, Shariq Shahbaz, Vladislav Indykov, and Daniel Strüber. 2025. On the use of GPT-4 in the reverse engineering of class diagrams. Journal of Object Technology24, 2 (2025), 1–14

2025

[7] [7]

Javier Luis Cánovas Izquierdo and Jesús García Molina. 2014. Extracting Models from Source Code in Software Modernization.Software and Systems Modeling13, 2 (2014), 713–734

2014

[8] [8]

Krzysztof Czarnecki and Simon Helsen. 2006. Feature-Based Survey of Model Transformation Approaches.IBM Systems Journal45, 3 (2006), 621–645. doi:10.1147/sj.453.0621

work page doi:10.1147/sj.453.0621 2006

[9] [9]

Rafael S Durelli, Daniel SM Santibánez, Bruno Marinho, Raphael Honda, Márcio E Delamaro, Nicolas Anquetil, and Valter Vieira de Camargo

[10] [10]

InProceedings of the 2014 IEEE 15th international conference on information reuse and integration (IEEE IRI 2014)

A mapping study on architecture-driven modernization. InProceedings of the 2014 IEEE 15th international conference on information reuse and integration (IEEE IRI 2014). IEEE, 577–584

2014

[11] [11]

Eclipse Epsilon Project. 2024. Epsilon EMC JDT Driver: On-Demand Java Model Connectivity via Eclipse JDT. https://eclipse.dev/epsilon/doc/ articles/jdt/ Accessed: 2025-12-01

2024

[12] [12]

Eclipse Foundation. 2024. Acceleo: M2T Code Generation. https://eclipse.dev/acceleo/. Implementing the OMG MOF Model to Text Transformation Language (MOFM2T) standard

2024

[13] [13]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

2023

[14] [14]

Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. 2025. Vibe Coding in Practice: Motivations, Challenges, and a Future Outlook–a Grey Literature Review.arXiv preprint arXiv:2510.00328(2025)

arXiv 2025

[15] [15]

Simon Foster, Yakoub Nemouchi, Mario Gleirscher, Ran Wei, and Tim Kelly. 2021. Integration of Formal Proof into Unified Assurance Cases with Isabelle/SACM.Formal Aspects of Computing33, 6 (2021), 855–884

2021

[16] [16]

Simon Foster, Yakoub Nemouchi, Colin O’Halloran, Karen Stephenson, and Nick Tudor. 2020. Formal Model-Based Assurance Cases in Isabelle/SACM: An Autonomous Underwater Vehicle Case Study. InFormaliSE 2020. ACM, 11–21

2020

[17] [17]

Thomas Gibson-Robinson, Philip Armstrong, Alexandre Boulgakov, and Andrew W Roscoe. 2014. FDR3: A Modern Refinement Checker for CSP. In TACAS (LNCS, Vol. 8413). Springer, 187–201

2014

[18] [18]

Florian Heidenreich, Jendrik Johannes, Mirko Seifert, and Christian Wende. 2010. Closing the Gap between Modelling and Java. InSoftware Language Engineering (SLE 2009) (LNCS, Vol. 5969). Springer, 374–383. doi:10.1007/978-3-642-12107-4_25

work page doi:10.1007/978-3-642-12107-4_25 2010

[19] [19]

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. 2023. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation.arXiv preprint arXiv:2312.13010(2023)

Pith/arXiv arXiv 2023

[20] [20]

2010.IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems

IEC. 2010.IEC 61508: Functional Safety of Electrical/Electronic/Programmable Electronic Safety-Related Systems. Technical Report. International Electrotechnical Commission

2010

[21] [21]

ISO. 2018. Road Vehicles – Functional Safety

2018

[22] [22]

Java Community Process. 2021. JSR-302: Safety-Critical Java Technology. Java Specification Request, public review draft. Started 2006; under JCP standardisation. https://jcp.org/en/jsr/detail?id=302

2021

[23] [23]

Andrej Karpathy. 2025. Vibe Coding. X (formerly Twitter). https://x.com/karpathy/status/1886192184808149383

arXiv 2025

[24] [24]

Tim Kelly. 2004. A Systematic Approach to Safety Case Management. InSAE 2004 World Congress & Exhibition. SAE Technical Paper

2004

[25] [25]

Dimitrios S Kolovos, Richard F Paige, and Fiona AC Polack. 2008. The Epsilon Transformation Language. InICMT 2008 (LNCS, Vol. 5063). Springer, 46–60

2008

[26] [26]

Jagun Kwon, Andy Wellings, and Steve King. 2005. Ravenscar-Java: A high-integrity profile for real-time Java.Concurrency and Computation: Practice and Experience17, 5–6 (2005), 681–713. doi:10.1002/cpe.843 Originally presented in 2002; journal version 2005

work page doi:10.1002/cpe.843 2005

[27] [27]

Andrea Lattuada, Travis Hance, Chanhee Cho, Matthias Brun, Isitha Subasinghe, Yi Zhou, Jon Howell, Bryan Parno, and Chris Hawblitzel. 2023. Verus: Verifying rust programs using linear ghost types.Proceedings of the ACM on Programming Languages7, OOPSLA1 (2023), 286–315

2023

[28] [28]

K Rustan M Leino. 2010. Dafny: An automatic program verifier for functional correctness. InInternational conference on logic for programming artificial intelligence and reasoning. Springer, 348–370. Manuscript submitted to ACM 22 Ran Wei, Le Zhu, Haochi Wang, Jim Woodcock, Fang Yan, Simon Foster, and Xiangyang Ji

2010

[29] [29]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in neural information processing systems36 (2023), 21558–21572

2023

[30] [30]

Dennis, Clare Dixon, and Michael Fisher

Matt Luckcuck, Marie Farrell, Louise A. Dennis, Clare Dixon, and Michael Fisher. 2019. Formal Specification and Verification of Autonomous Robotic Systems: A Survey.Comput. Surveys52, 5 (2019), 100:1–100:41. doi:10.1145/3342355

work page doi:10.1145/3342355 2019

[31] [31]

Tom Mens and Pieter Van Gorp. 2006. A Taxonomy of Model Transformation.Electronic Notes in Theoretical Computer Science152 (2006), 125–142. doi:10.1016/j.entcs.2005.10.021

work page doi:10.1016/j.entcs.2005.10.021 2006

[32] [32]

MISRA. 2012. MISRA C:2012: Guidelines for the Use of the C Language in Critical Systems

2012

[33] [33]

Md Rakib Hossain Misu, Cristina V Lopes, Iris Ma, and James Noble. 2024. Towards ai-assisted synthesis of verified dafny methods.Proceedings of the ACM on Software Engineering1, FSE (2024), 812–835

2024

[34] [34]

Alvaro Miyazawa, Pedro Ribeiro, Wei Li, Ana Cavalcanti, Jon Timmis, and Jim Woodcock. 2019. RoboChart: Modelling and Verification of the Functional Behaviour of Robotic Applications.Software and Systems Modeling18, 5 (2019), 3097–3149

2019

[35] [35]

Leonardo de Moura and Sebastian Ullrich. 2021. The lean 4 theorem prover and programming language. InInternational Conference on Automated Deduction. Springer, 625–635

2021

[36] [36]

Prasita Mukherjee, Minghai Lu, and Benjamin Delaware. 2025. LLM-Assisted Synthesis of High-Assurance C Programs. In2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 3108

2025

[37] [37]

Paulson, and Markus Wenzel

Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson. 2002.Isabelle/HOL: A Proof Assistant for Higher-Order Logic. Lecture Notes in Computer Science, Vol. 2283. Springer. doi:10.1007/3-540-45949-9

work page doi:10.1007/3-540-45949-9 2002

[38] [38]

2015.XML Metadata Interchange (XMI) Specification

Object Management Group. 2015.XML Metadata Interchange (XMI) Specification. Technical Report Version 2.5.1. OMG. https://www.omg.org/spec/ XMI/

2015

[39] [39]

2020.Structured Assurance Case Metamodel Specification

Object Management Group. 2020.Structured Assurance Case Metamodel Specification. Technical Report. OMG. https://www.omg.org/spec/SACM/

2020

[40] [40]

Renaud Pawlak, Martin Monperrus, Nicolas Petitprez, Carlos Noguera, and Lionel Seinturier. 2015. Spoon: A Library for Implementing Analyses and Transformations of Java Source Code.Software: Practice and Experience46, 9 (2015), 1155–1179. doi:10.1002/spe.2346

work page doi:10.1002/spe.2346 2015

[41] [41]

Veronica Pimenova, Sarah Fakhoury, Christian Bird, Margaret-Anne Storey, and Madeline Endres. 2025. Good Vibrations? A Qualitative Study of Co-Creation, Communication, Flow, and Trust in Vibe Coding. arXiv:2509.12491 [cs.SE] https://arxiv.org/abs/2509.12491

arXiv 2025

[42] [42]

Rose, Richard F

Louis M. Rose, Richard F. Paige, Dimitrios S. Kolovos, and Fiona A. C. Polack. 2008. The Epsilon Generation Language. InModel Driven Architecture– Foundations and Applications (ECMDA-FA 2008) (Lecture Notes in Computer Science, Vol. 5095). Springer, 1–16. doi:10.1007/978-3-540-69100-6_1

work page doi:10.1007/978-3-540-69100-6_1 2008

[43] [43]

2011.DO-178C: Software Considerations in Airborne Systems and Equipment Certification

RTCA. 2011.DO-178C: Software Considerations in Airborne Systems and Equipment Certification. Technical Report. RTCA Inc

2011

[44] [44]

2011.DO-332: Object-Oriented Technology and Related Techniques Supplement to DO-178C and DO-278A

RTCA. 2011.DO-332: Object-Oriented Technology and Related Techniques Supplement to DO-178C and DO-278A. RTCA, Inc. / EUROCAE (as ED-217), Washington, DC. Published December 2011

2011

[45] [45]

Shane Sendall and Wojtek Kozaczynski. 2003. Model Transformation: The Heart and Soul of Model-Driven Software Development.IEEE Software20, 5 (2003), 42–45. doi:10.1109/MS.2003.1231150

work page doi:10.1109/ms.2003.1231150 2003

[46] [46]

Hanan Siala and Kevin Lano. 2025. A comparison of large language models and model-driven reverse engineering for reverse engineering.Frontiers in Computer Science7 (2025), 1516410

2025

[47] [47]

2008.EMF: Eclipse Modeling Framework(2nd ed.)

Dave Steinberg, Frank Budinsky, Marcelo Paternostro, and Ed Merks. 2008.EMF: Eclipse Modeling Framework(2nd ed.). Addison-Wesley

2008

[48] [48]

Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. 2024. Clover: Closed-loop verifiable code generation. InInternational Symposium on AI Verification. Springer, 134–155

2024

[49] [49]

Florian Tambon, Arghavan Moradi-Dakhel, Amin Nikanjam, Foutse Khomh, Michel C Desmarais, and Giuliano Antoniol. 2025. Bugs in large language models generated code: An empirical study.Empirical Software Engineering30, 3 (2025), 65

2025

[50] [50]

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri

[51] [51]

CLEVER: A Curated Benchmark for Formally Verified Code Generation.Advances in Neural Information Processing Systems38 (2026)

2026

[52] [52]

The HIJA Consortium. 2007. High-Integrity Java Applications (HIJA): A new programming standard for safety–critical embedded systems. European Commission IST project IST-2001-35256 final report. Project ran 2004–2007

2007

[53] [53]

Ran Wei, Simon Foster, Haitao Mei, Fang Yan, Ruizhe Yang, Ibrahim Habli, Colin O’Halloran, Nick Tudor, Tim Kelly, and Yakoub Nemouchi. 2024. ACCESS: Assurance case centric engineering of safety–critical systems.Journal of Systems and Software213 (2024), 112034

2024

[54] [54]

Ran Wei, Tim P Kelly, Xiaotian Dai, Shuai Zhao, and Richard Hawkins. 2019. Model Based System Assurance Using the Structured Assurance Case Metamodel.Journal of Systems and Software154 (2019), 211–233

2019

[55] [55]

Jim Woodcock, Peter Gorm Larsen, Juan Bicarregui, and John Fitzgerald. 2009. Formal Methods: Practice and Experience.Comput. Surveys41, 4 (2009), 19:1–19:36. doi:10.1145/1592434.1592436

work page doi:10.1145/1592434.1592436 2009

[56] [56]

Awadallah, Ryen W

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed H. Awadallah, Ryen W. White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. In COLM

2024

[57] [57]

Xu Xu, Xin Li, Xingwei Qu, Jie Fu, and Binhang Yuan. 2025. Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification.arXiv preprint arXiv:2509.23061(2025)

arXiv 2025

[58] [58]

Fang Yan, Simon Foster, and Ibrahim Habli. 2023. Automated Compositional Verification for Robotic State Machines Using Isabelle/HOL. InICECCS. IEEE, 167–176. Manuscript submitted to ACM

2023