LLM4MTLs: Automated Generation and Empirical Evaluation of Model Transformation Languages

Anne Koziolek; Arne Lange; Benedikt Jutz; Bowen Jiang; Haowei Cheng; Nathan Hagel; Rahul Sharma; Ralf Reussner; Weixing Zhang

arxiv: 2606.25193 · v1 · pith:HAQMYLI5new · submitted 2026-06-23 · 💻 cs.SE

LLM4MTLs: Automated Generation and Empirical Evaluation of Model Transformation Languages

Bowen Jiang , Nathan Hagel , Haowei Cheng , Benedikt Jutz , Arne Lange , Weixing Zhang , Rahul Sharma , Ralf Reussner

show 1 more author

Anne Koziolek

This is my paper

Pith reviewed 2026-06-25 22:37 UTC · model grok-4.3

classification 💻 cs.SE

keywords model transformation languageslarge language modelsprompt engineeringATLETLQVTosyntactic correctnesssemantic correctness

0 comments

The pith

Few-shot prompting improves syntactic quality of LLM-generated code across four model transformation languages, while semantic gains remain uneven and language-dependent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLM4MTLs, an automated workflow that builds and compares prompting strategies for generating code in model transformation languages with large language models. It combines few-shot examples, grammar prompts, and helper methods, then measures results on four languages using syntactic validity and semantic correctness against executable references and test suites. Few-shot prompting raises syntactic scores reliably for ATL, ETL, QVTo, and Reactions, yet semantic correctness improves only in some languages and stays flat for ATL. The work supplies a reusable evaluation suite so that future prompting experiments can be compared on the same basis. This matters for anyone who must produce correct model transformations, a task that normally demands both language expertise and domain knowledge.

Core claim

An automated workflow systematically explores prompt constructions that combine few-shot prompting, grammar prompting, and helper method inclusion, then evaluates them on four MTLs with syntactic and semantic metrics derived from executable reference scripts and manually written test suites. Few-shot prompting consistently improves syntactic quality across all four languages while semantic correctness gains are uneven and language-dependent; for ATL, Pass@1 remains unchanged across strategies and models. Grammar prompting stabilizes generation when combined with few-shot examples but can be ineffective or counterproductive alone. LLM choice influences syntactic correctness and similarity for

What carries the argument

The LLM4MTLs workflow that constructs prompt variants and scores them with syntactic and semantic metrics on reference scripts and test suites for ATL, ETL, QVTo, and Reactions.

If this is right

Few-shot prompting can be applied to raise syntactic validity when generating code in ATL, ETL, QVTo, or Reactions.
Grammar prompting produces more stable output only when paired with few-shot examples rather than used alone.
Helper method inclusion can further improve results for certain language-model pairs.
LLM selection affects syntactic metrics more noticeably for ETL and QVTo than for semantic metrics across the board.
ATL shows no measurable semantic improvement from any of the tested prompting strategies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same workflow could be applied to other domain-specific languages whose training data is similarly sparse.
Teams facing MTL development might first adopt few-shot prompting for quick syntactic fixes before investing in grammar engineering.
Semantic correctness may ultimately require methods beyond prompting, such as retrieval of transformation patterns or targeted fine-tuning.
The language-dependent semantic results point to differences in how each MTL's structure interacts with LLM token prediction.
The evaluation suite itself becomes a shared benchmark that later studies can extend with additional languages or metrics.

Load-bearing premise

The manually written test suites used to measure semantic correctness are representative and free of bias in coverage or construction.

What would settle it

Re-running the identical evaluation on a fresh collection of test cases that deliberately includes more edge cases and transformations absent from the original suites, then observing whether the reported syntactic gains persist while semantic gains remain language-dependent.

read the original abstract

Model transformation languages (MTLs) are domain-specific languages for transforming models conforming to a given metamodel into other models, including textual models such as source code. Developing correct model transformations is challenging, requiring both language-specific and domain knowledge, and motivating the use of large language models (LLMs) for MTL code generation. However, due to limited training data and executable examples, LLM-generated MTL code is often not syntactically valid or semantically usable out of the box. This paper presents LLM4MTLs, an automated workflow for constructing and comparing prompting strategies for LLM-generated MTL code, together with an evaluation suite and an empirical evaluation. The workflow systematically explores prompt constructions combining few-shot prompting, grammar prompting, and helper method inclusion, and evaluates them using syntactic and semantic metrics. We construct an evaluation suite spanning four MTLs (ATL, ETL, QVTo, and the Reactions language) with executable reference scripts and manually written test suites, and evaluate across three LLMs. We find that few-shot prompting consistently improves syntactic quality across all four MTLs while gains in semantic correctness are uneven and language-dependent. For ATL, Pass@1 remains unchanged across all strategies and models, indicating that few-shot prompting improves surface-level syntax more readily than deep transformation semantics. Grammar prompting stabilizes code generation when combined with few-shot examples, but in isolation it can be ineffective or even counterproductive for certain model-language combinations. Including helper methods as a complementary amplifier can also be beneficial. Finally, LLM choice influences syntactic correctness and similarity for certain MTLs, particularly ETL and QVTo, while its influence on semantic correctness remains limited.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a practical empirical comparison of prompting strategies for LLM-generated MTL code across four languages, but the semantic claims rest on test suites with no reported construction details.

read the letter

This paper's core contribution is an empirical head-to-head on prompting techniques for generating code in four model transformation languages using LLMs. They show few-shot examples lift syntactic quality across the board, while semantic correctness gains are patchier and depend on the language, with ATL showing no improvement in Pass@1 no matter the strategy.

The new pieces are the automated workflow that mixes few-shot, grammar, and helper methods, plus the evaluation suite that includes executable reference transformations and manually written test suites for four MTLs. Running this on three different LLMs gives a decent picture of what works for syntax and where semantics lag.

The syntactic results look solid because they rely on parser checks and similarity measures that can be computed directly. The language-specific findings on semantics are the part that could be interesting to people building tools in this area.

The main weakness is the lack of information on those test suites. No word on how many test cases per language, what metamodel elements they cover, or any process used to build or validate them. Without that, it's hard to know if the uneven semantic results reflect real differences in how prompting works or just how the tests were chosen. The ATL result in particular might shift if the suite was expanded.

This kind of targeted empirical work fits readers who care about practical LLM use in software engineering, especially model-driven approaches. It doesn't claim big theoretical advances but supplies data that could guide prompt design.

I would send this to peer review. The setup is clear enough that referees can ask for the missing test-suite details and check whether the conclusions hold.

Referee Report

1 major / 0 minor

Summary. The paper presents LLM4MTLs, an automated workflow for exploring and comparing prompting strategies (few-shot, grammar prompting, helper methods) for LLM generation of code in four model transformation languages (ATL, ETL, QVTo, Reactions). It evaluates three LLMs using syntactic metrics (parser success, similarity) and semantic metrics (Pass@1 on executable reference scripts paired with manually written test suites), reporting that few-shot prompting yields consistent syntactic gains across all MTLs while semantic gains are uneven and language-dependent, with ATL showing no Pass@1 improvement across any strategy or model.

Significance. If the results hold, the work supplies empirical evidence on the relative effectiveness of prompting techniques for an under-served domain (MTL code generation) where training data is scarce. The systematic multi-language, multi-LLM design and dual syntactic/semantic evaluation are strengths; the finding that syntactic improvements do not reliably produce semantic gains is a practically useful observation for model-driven engineering. The automated workflow itself supports reproducibility.

major comments (1)

[Evaluation suite] Evaluation suite description: The paper provides no information on the construction, size, coverage criteria (metamodel element coverage, rule interactions, edge cases), or validation of the manually written test suites used to compute semantic correctness (Pass@1). Because the central claims about uneven, language-dependent semantic gains and the ATL null result rest exclusively on these suites, the absence of such details leaves open the possibility that reported differences are artifacts of test-suite bias rather than properties of the prompting strategies. This is load-bearing for the semantic half of the empirical contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important gap in the presentation of our evaluation methodology. We address the major comment below and commit to a revision that strengthens the semantic evaluation section.

read point-by-point responses

Referee: [Evaluation suite] Evaluation suite description: The paper provides no information on the construction, size, coverage criteria (metamodel element coverage, rule interactions, edge cases), or validation of the manually written test suites used to compute semantic correctness (Pass@1). Because the central claims about uneven, language-dependent semantic gains and the ATL null result rest exclusively on these suites, the absence of such details leaves open the possibility that reported differences are artifacts of test-suite bias rather than properties of the prompting strategies. This is load-bearing for the semantic half of the empirical contribution.

Authors: We agree that the current manuscript lacks sufficient detail on test-suite construction, which is necessary to substantiate the semantic results. In the revised version we will add a dedicated subsection (approximately 4.2) describing: (1) the size of each suite (number of test cases per MTL), (2) coverage criteria explicitly including metamodel element coverage, rule-interaction coverage, and selected edge cases, (3) the process used to validate the suites against the executable reference scripts, and (4) steps taken to reduce bias (e.g., independent review of test cases). These additions will allow readers to assess whether the reported language-dependent semantic differences reflect prompting strategy properties rather than suite artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or fitted predictions

full rationale

The paper reports an empirical study comparing prompting strategies for LLM-generated MTL code across four languages and three models. It defines syntactic metrics (parser success, similarity) and semantic metrics (Pass@1 via manually written test suites with reference scripts) and directly measures outcomes. No equations, parameter fitting presented as prediction, self-citation chains, or ansatzes are used to derive results; all claims rest on explicit experimental runs. The evaluation suite construction is described at a high level but does not reduce any reported finding to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of executable reference transformations and manually authored test suites whose semantic coverage is taken as given; no free parameters are fitted, no new entities are postulated, and the axioms are standard assumptions about LLM behavior and test-suite validity.

axioms (2)

domain assumption Manually written test suites provide an unbiased and sufficiently complete measure of semantic correctness for the chosen transformations.
Invoked when the abstract reports semantic metrics without describing suite validation or coverage analysis.
domain assumption Syntactic validity and Pass@1 on the provided tests are adequate proxies for practical usability of generated MTL code.
Underlies the decision to report syntactic quality and Pass@1 as primary outcomes.

pith-pipeline@v0.9.1-grok · 5852 in / 1408 out tokens · 21667 ms · 2026-06-25T22:37:06.487375+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 27 canonical work pages

[1]

On the Assessment of Generative

Cámara, Javier and Troya, Javier and Burgueño, Lola and Vallecillo, Antonio , year =. On the Assessment of Generative. Software and Systems Modeling , volume =
[2]

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs , booktitle =

Zhang, Weixing and Hebig, Regina and Str. Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs , booktitle =. 2025 , address =

2025
[3]

arXiv preprint arXiv:2602.11904 , year=

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation , author=. arXiv preprint arXiv:2602.11904 , year=

arXiv
[4]

Model-Based Trust Analysis of LLM Conversations , year =

Buchmann, Thomas , year =. Prompting. Proceedings of the. doi:10.1145/3652620.3687802 , isbn =

work page doi:10.1145/3652620.3687802
[5]

Automating

Cibrián, Eduardo and Olivert Iserte, José Francisco and Casella, Francesco and García Rodríguez, Mario and Alvarez-Rodríguez, Jose María and Llorens, Juan , year =. Automating. doi:10.2139/ssrn.5564442 , pubstate =. 5564442 , eprinttype =

work page doi:10.2139/ssrn.5564442
[6]

Piloting

Döderlein, Jean-Baptiste and Kouadio, Nguessan Hermann and Acher, Mathieu and Khelladi, Djamel Eddine and Combemale, Benoit , year =. Piloting. 2210.14699 , eprinttype =

arXiv
[7]

Proceedings of the 25th

Kolovos, Dimitris and Garcia-Dominguez, Antonio , title =. Proceedings of the 25th. 2022 , isbn =. doi:10.1145/3550356.3556507 , abstract =

work page doi:10.1145/3550356.3556507 2022
[9]

About the

OMG , year =. About the
[10]

UPGRADE, The European Journal for the Informatics Professional , volume=

Model differences in the Eclipse Modeling Framework , author=. UPGRADE, The European Journal for the Informatics Professional , volume=
[11]

Building

Klare, Heiko , year = 2021, publisher =. Building. doi:10.5445/IR/1000133724 , urldate =

work page doi:10.5445/ir/1000133724 2021
[12]

Orthographic

Atkinson, Colin and Stoll, Dietmar and Bostan, Philipp , editor =. Orthographic. Evaluation of. 2010 , series =

2010
[13]

Docker: Accelerated Container Application Development , url =
[14]

vitruv-tools/Methodologist-Template , url =
[15]

2025 , langid =

GPT-5.1 , url =. 2025 , langid =

2025
[16]

2025 , titleaddon =

Gemini 2.5. 2025 , titleaddon =

2025
[17]

Bleu: a method for automatic evaluation of machine translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , year =. Proceedings of the 40th. doi:10.3115/1073083.1073135 , shorttitle =

work page doi:10.3115/1073083.1073135
[18]

chr F : character n-gram F -score for automatic MT evaluation

Popović, Maja , editor =. Proceedings of the. 2015 , month =. doi:10.18653/v1/W15-3049 , shorttitle =

work page doi:10.18653/v1/w15-3049 2015
[19]

Text Summarization Branches Out , publisher =

Lin, Chin-Yew , year =. Text Summarization Branches Out , publisher =
[20]

Proceedings of the

Banerjee, Satanjeev and Lavie, Alon , editor =. Proceedings of the. 2005 , month =

2005
[21]

A Comparative Study of

Bassamzadeh, Nastaran and Methani, Chhaya , year =. A Comparative Study of. doi:10.48550/arXiv.2407.02742 , shorttitle =. 2407.02742 [cs] , keywords =

work page doi:10.48550/arxiv.2407.02742
[22]

Benchmarks and

Paul, Debalina Ghosh and Zhu, Hong and Bayley, Ian , year =. Benchmarks and. doi:10.48550/arXiv.2406.12655 , shorttitle =. 2406.12655 [cs] , note =

work page doi:10.48550/arxiv.2406.12655
[23]

arXiv e-prints , pages=

LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation , author=. arXiv e-prints , pages=
[24]

Mens, Tom and Van Gorp, Pieter , year =. A. Electronic Notes in Theoretical Computer Science , shortjournal =. doi:10.1016/j.entcs.2005.10.021 , langid =

work page doi:10.1016/j.entcs.2005.10.021 2005
[25]

and Langhammer, Michael and Werle, Dominik and Burger, Erik and Reussner, Ralf , year =

Klare, Heiko and Kramer, Max E. and Langhammer, Michael and Werle, Dominik and Burger, Erik and Reussner, Ralf , year =. Enabling. Journal of Systems and Software , shortjournal =. doi:10.1016/j.jss.2020.110815 , langid =

work page doi:10.1016/j.jss.2020.110815 2020
[26]

Science of Computer Programming , shortjournal =

Jouault, Frédéric and Allilaire, Freddy and Bézivin, Jean and Kurtev, Ivan , year =. Science of Computer Programming , shortjournal =. doi:10.1016/j.scico.2007.08.002 , abstract =

work page doi:10.1016/j.scico.2007.08.002 2007
[27]

Garaccione, Giacomo and Calabrese, Diego Maria and Coppola, Riccardo and Ardito, Luca , year =. A
[28]

InMODELS New Ideas and Emerging Results (NIER) Track 2025(Grand Rapids, MI, USA)

Eisenberg, Martin and Klikovits, Stefan and Wimmer, Manuel and Wielan, Konrad , title =. Proceedings of the. 2025 , month =. doi:10.1109/MODELS67397.2025.00032 , publisher =

work page doi:10.1109/models67397.2025.00032 2025
[29]

Joel, Sathvik and Wu, Jie and Fard, Fatemeh , year =. A. ACM Transactions on Software Engineering and Methodology , shortjournal =. doi:10.1145/3770084 , abstract =

work page doi:10.1145/3770084
[30]

Brambilla, Marco and Cabot, Jordi and Wimmer, Manuel , year =. Model-. doi:10.1007/978-3-031-02549-5 , isbn =

work page doi:10.1007/978-3-031-02549-5
[31]

2019 , month =

Meta. 2019 , month =

2019
[32]

OMG , year =. Object
[33]

2025 , abstract =

N8n.Io -. 2025 , abstract =

2025
[34]

Towards an

Pontes Miranda, James William and Bruneliere, Hugo and Tisi, Massimo and Sunyé, Gerson , year =. Towards an. Proceedings of the 17th. doi:10.1145/3687997.3695650 , abstract =

work page doi:10.1145/3687997.3695650
[35]

Automation in

Burgueño, Lola and Di Ruscio, Davide and Sahraoui, Houari and Wimmer, Manuel , date =. Automation in. doi:10.1145/3712008 , abstract =

work page doi:10.1145/3712008
[36]

Luaces, Miguel and Garcia-Gonzalez, Daniel , date =

Lamas, Victor and R. Luaces, Miguel and Garcia-Gonzalez, Daniel , date =. Proceedings of the. doi:10.1145/3652620.3687782 , abstract =

work page doi:10.1145/3652620.3687782
[37]

Proceedings of the 2024

Abukhalaf, Seif and Hamdaqa, Mohammad and Khomh, Foutse , year =. Proceedings of the 2024. doi:10.1145/3650105.3652290 , abstract =

work page doi:10.1145/3650105.3652290 2024
[38]

2025 , booktitle =

Duy Dao and Alessio Bucaioni and Antonio Cicchetti , title =. 2025 , booktitle =

2025
[39]

Mündler, Niels and He, Jingxuan and Wang, Hao and Sen, Koushik and Song, Dawn and Vechev, Martin , date =. Type-. doi:10.1145/3729274 , abstract =

work page doi:10.1145/3729274
[40]

Kazai, Gabriel and Osei, Ronnie Agyeiwaa and Bucaioni, Alessio and Cicchetti, Antonio , abstract =. Model
[41]

The Families to Persons Case , booktitle =

Anthony Anjorin and Thomas Buchmann and Bernhard Westfechtel , editor =. The Families to Persons Case , booktitle =. 2017 , timestamp =

2017
[42]

Varró, Dániel , editor =. Model. Model. 2006 , pages =. doi:10.1007/11880240_29 , abstract =

work page doi:10.1007/11880240_29 2006
[43]

2003 , url =

Anneke Kleppe and Jos Warmer and Wim Bast , title =. 2003 , url =

2003
[44]

Advantages and

Höppner, Stefan and Haas, Yves and Tichy, Matthias and Juhnke, Katharina , year =. Advantages and. Empirical Software Engineering , shortjournal =. doi:10.1007/s10664-022-10194-7 , abstract =

work page doi:10.1007/s10664-022-10194-7
[45]

1992 , publisher=

Statistical methods for psychology , author=. 1992 , publisher=

1992
[46]

doi:10.5281/zenodo.19683666 , url =

Jiang, Bowen and Hagel, Nathan and Cheng, Haowei , year =. doi:10.5281/zenodo.19683666 , url =

work page doi:10.5281/zenodo.19683666
[47]

Nathan Hagel and Nicolas Hili and Alexander Bartel and Anne Koziolek , title =. 22nd. 2025 , doi =

2025
[48]

Turning Low-Code Development Platforms into True No-Code with LLMs , booktitle =

Nathan Hagel and Nicolas Hili and Didier Schwab , editor =. Turning Low-Code Development Platforms into True No-Code with LLMs , booktitle =. 2024 , doi =

2024
[49]

International conference on model driven engineering languages and systems , pages=

UML2Alloy: A challenging model transformation , author=. International conference on model driven engineering languages and systems , pages=. 2007 , organization=

2007
[50]

gradient descent

Automatic prompt optimization with "gradient descent" and beam search , author=. arXiv preprint arXiv:2305.03495 , doi =. 2023 , month =

arXiv 2023
[51]

arXiv preprint arXiv:2311.05661 , year=

Prompt engineering a prompt engineer , author=. arXiv preprint arXiv:2311.05661 , year=

arXiv
[52]

2018 , note =

Eclipse Foundation , title =. 2018 , note =

2018
[53]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
[54]

Vitruv-CaseStudies , year =
[55]

ATL Zoo Benchmark , year =
[56]

Eclipse Epsilon , year =
[57]

org.eclipse.qvto , year =
[58]

ACM Transactions on Software Engineering and Methodology , year=

A survey on llm-based code generation for low-resource and domain-specific programming languages , author=. ACM Transactions on Software Engineering and Methodology , year=
[59]

Advances in Neural Information Processing Systems , volume=

Grammar prompting for domain-specific language generation with large language models , author=. Advances in Neural Information Processing Systems , volume=
[60]

General-purpose Languages: A Historical Perspective on ATL vs

Dedicated Model Transformation Languages vs. General-purpose Languages: A Historical Perspective on ATL vs. Java. , author=. MODELSWARD , pages=
[61]

doi:10.5445/IR/1000193410 , url =

Large Language Models in Model-Driven Engineering: A Systematic Mapping Study , author =. doi:10.5445/IR/1000193410 , url =

work page doi:10.5445/ir/1000193410
[62]

CoRR , volume =

Domenico Amalfitano and Andreas Metzger and Marco Autili and Tommaso Fulcini and Tobias Hey and Jan Keim and Patrizio Pelliccione and Vincenzo Scotti and Anne Koziolek and Raffaela Mirandola and Andreas Vogelsang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.26275 , eprinttype =. 2510.26275 , timestamp =

work page doi:10.48550/arxiv.2510.26275 2025
[63]

CoRR , volume =

Yusei Ishimizu and Takuto Yamauchi and Sinan Chen and Jinyu Cai and Jialong Li and Kenji Tei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.07261 , eprinttype =. 2512.07261 , timestamp =

work page doi:10.48550/arxiv.2512.07261 2025
[64]

Towards Efficient Discrete Controller Synthesis: Semantics-Aware Stepwise Policy Design via LLM , year=

Ishimizu, Yusei and Li, Jialong and Yamauchi, Takuto and Chen, Sinan and Cai, Jinyu and Hirano, Takanori and Tei, Kenji , booktitle=. Towards Efficient Discrete Controller Synthesis: Semantics-Aware Stepwise Policy Design via LLM , year=
[65]

2024 , url =

Jialong Li and Mingyue Zhang and Nianyu Li and Danny Weyns and Zhi Jin and Kenji Tei , title =. 2024 , url =. doi:10.1145/3686803 , timestamp =

work page doi:10.1145/3686803 2024
[66]

Exploring the Potential of Large Language Models in Self-adaptive Systems , booktitle =

Jialong Li and Mingyue Zhang and Nianyu Li and Danny Weyns and Zhi Jin and Kenji Tei , editor =. Exploring the Potential of Large Language Models in Self-adaptive Systems , booktitle =. 2024 , url =. doi:10.1145/3643915.3644088 , timestamp =

work page doi:10.1145/3643915.3644088 2024

[1] [1]

On the Assessment of Generative

Cámara, Javier and Troya, Javier and Burgueño, Lola and Vallecillo, Antonio , year =. On the Assessment of Generative. Software and Systems Modeling , volume =

[2] [2]

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs , booktitle =

Zhang, Weixing and Hebig, Regina and Str. Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs , booktitle =. 2025 , address =

2025

[3] [3]

arXiv preprint arXiv:2602.11904 , year=

Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation , author=. arXiv preprint arXiv:2602.11904 , year=

arXiv

[4] [4]

Model-Based Trust Analysis of LLM Conversations , year =

Buchmann, Thomas , year =. Prompting. Proceedings of the. doi:10.1145/3652620.3687802 , isbn =

work page doi:10.1145/3652620.3687802

[5] [5]

Automating

Cibrián, Eduardo and Olivert Iserte, José Francisco and Casella, Francesco and García Rodríguez, Mario and Alvarez-Rodríguez, Jose María and Llorens, Juan , year =. Automating. doi:10.2139/ssrn.5564442 , pubstate =. 5564442 , eprinttype =

work page doi:10.2139/ssrn.5564442

[6] [6]

Piloting

Döderlein, Jean-Baptiste and Kouadio, Nguessan Hermann and Acher, Mathieu and Khelladi, Djamel Eddine and Combemale, Benoit , year =. Piloting. 2210.14699 , eprinttype =

arXiv

[7] [7]

Proceedings of the 25th

Kolovos, Dimitris and Garcia-Dominguez, Antonio , title =. Proceedings of the 25th. 2022 , isbn =. doi:10.1145/3550356.3556507 , abstract =

work page doi:10.1145/3550356.3556507 2022

[8] [9]

About the

OMG , year =. About the

[9] [10]

UPGRADE, The European Journal for the Informatics Professional , volume=

Model differences in the Eclipse Modeling Framework , author=. UPGRADE, The European Journal for the Informatics Professional , volume=

[10] [11]

Building

Klare, Heiko , year = 2021, publisher =. Building. doi:10.5445/IR/1000133724 , urldate =

work page doi:10.5445/ir/1000133724 2021

[11] [12]

Orthographic

Atkinson, Colin and Stoll, Dietmar and Bostan, Philipp , editor =. Orthographic. Evaluation of. 2010 , series =

2010

[12] [13]

Docker: Accelerated Container Application Development , url =

[13] [14]

vitruv-tools/Methodologist-Template , url =

[14] [15]

2025 , langid =

GPT-5.1 , url =. 2025 , langid =

2025

[15] [16]

2025 , titleaddon =

Gemini 2.5. 2025 , titleaddon =

2025

[16] [17]

Bleu: a method for automatic evaluation of machine translation

Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , year =. Proceedings of the 40th. doi:10.3115/1073083.1073135 , shorttitle =

work page doi:10.3115/1073083.1073135

[17] [18]

chr F : character n-gram F -score for automatic MT evaluation

Popović, Maja , editor =. Proceedings of the. 2015 , month =. doi:10.18653/v1/W15-3049 , shorttitle =

work page doi:10.18653/v1/w15-3049 2015

[18] [19]

Text Summarization Branches Out , publisher =

Lin, Chin-Yew , year =. Text Summarization Branches Out , publisher =

[19] [20]

Proceedings of the

Banerjee, Satanjeev and Lavie, Alon , editor =. Proceedings of the. 2005 , month =

2005

[20] [21]

A Comparative Study of

Bassamzadeh, Nastaran and Methani, Chhaya , year =. A Comparative Study of. doi:10.48550/arXiv.2407.02742 , shorttitle =. 2407.02742 [cs] , keywords =

work page doi:10.48550/arxiv.2407.02742

[21] [22]

Benchmarks and

Paul, Debalina Ghosh and Zhu, Hong and Bayley, Ian , year =. Benchmarks and. doi:10.48550/arXiv.2406.12655 , shorttitle =. 2406.12655 [cs] , note =

work page doi:10.48550/arxiv.2406.12655

[22] [23]

arXiv e-prints , pages=

LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation , author=. arXiv e-prints , pages=

[23] [24]

Mens, Tom and Van Gorp, Pieter , year =. A. Electronic Notes in Theoretical Computer Science , shortjournal =. doi:10.1016/j.entcs.2005.10.021 , langid =

work page doi:10.1016/j.entcs.2005.10.021 2005

[24] [25]

and Langhammer, Michael and Werle, Dominik and Burger, Erik and Reussner, Ralf , year =

Klare, Heiko and Kramer, Max E. and Langhammer, Michael and Werle, Dominik and Burger, Erik and Reussner, Ralf , year =. Enabling. Journal of Systems and Software , shortjournal =. doi:10.1016/j.jss.2020.110815 , langid =

work page doi:10.1016/j.jss.2020.110815 2020

[25] [26]

Science of Computer Programming , shortjournal =

Jouault, Frédéric and Allilaire, Freddy and Bézivin, Jean and Kurtev, Ivan , year =. Science of Computer Programming , shortjournal =. doi:10.1016/j.scico.2007.08.002 , abstract =

work page doi:10.1016/j.scico.2007.08.002 2007

[26] [27]

Garaccione, Giacomo and Calabrese, Diego Maria and Coppola, Riccardo and Ardito, Luca , year =. A

[27] [28]

InMODELS New Ideas and Emerging Results (NIER) Track 2025(Grand Rapids, MI, USA)

Eisenberg, Martin and Klikovits, Stefan and Wimmer, Manuel and Wielan, Konrad , title =. Proceedings of the. 2025 , month =. doi:10.1109/MODELS67397.2025.00032 , publisher =

work page doi:10.1109/models67397.2025.00032 2025

[28] [29]

Joel, Sathvik and Wu, Jie and Fard, Fatemeh , year =. A. ACM Transactions on Software Engineering and Methodology , shortjournal =. doi:10.1145/3770084 , abstract =

work page doi:10.1145/3770084

[29] [30]

Brambilla, Marco and Cabot, Jordi and Wimmer, Manuel , year =. Model-. doi:10.1007/978-3-031-02549-5 , isbn =

work page doi:10.1007/978-3-031-02549-5

[30] [31]

2019 , month =

Meta. 2019 , month =

2019

[31] [32]

OMG , year =. Object

[32] [33]

2025 , abstract =

N8n.Io -. 2025 , abstract =

2025

[33] [34]

Towards an

Pontes Miranda, James William and Bruneliere, Hugo and Tisi, Massimo and Sunyé, Gerson , year =. Towards an. Proceedings of the 17th. doi:10.1145/3687997.3695650 , abstract =

work page doi:10.1145/3687997.3695650

[34] [35]

Automation in

Burgueño, Lola and Di Ruscio, Davide and Sahraoui, Houari and Wimmer, Manuel , date =. Automation in. doi:10.1145/3712008 , abstract =

work page doi:10.1145/3712008

[35] [36]

Luaces, Miguel and Garcia-Gonzalez, Daniel , date =

Lamas, Victor and R. Luaces, Miguel and Garcia-Gonzalez, Daniel , date =. Proceedings of the. doi:10.1145/3652620.3687782 , abstract =

work page doi:10.1145/3652620.3687782

[36] [37]

Proceedings of the 2024

Abukhalaf, Seif and Hamdaqa, Mohammad and Khomh, Foutse , year =. Proceedings of the 2024. doi:10.1145/3650105.3652290 , abstract =

work page doi:10.1145/3650105.3652290 2024

[37] [38]

2025 , booktitle =

Duy Dao and Alessio Bucaioni and Antonio Cicchetti , title =. 2025 , booktitle =

2025

[38] [39]

Mündler, Niels and He, Jingxuan and Wang, Hao and Sen, Koushik and Song, Dawn and Vechev, Martin , date =. Type-. doi:10.1145/3729274 , abstract =

work page doi:10.1145/3729274

[39] [40]

Kazai, Gabriel and Osei, Ronnie Agyeiwaa and Bucaioni, Alessio and Cicchetti, Antonio , abstract =. Model

[40] [41]

The Families to Persons Case , booktitle =

Anthony Anjorin and Thomas Buchmann and Bernhard Westfechtel , editor =. The Families to Persons Case , booktitle =. 2017 , timestamp =

2017

[41] [42]

Varró, Dániel , editor =. Model. Model. 2006 , pages =. doi:10.1007/11880240_29 , abstract =

work page doi:10.1007/11880240_29 2006

[42] [43]

2003 , url =

Anneke Kleppe and Jos Warmer and Wim Bast , title =. 2003 , url =

2003

[43] [44]

Advantages and

Höppner, Stefan and Haas, Yves and Tichy, Matthias and Juhnke, Katharina , year =. Advantages and. Empirical Software Engineering , shortjournal =. doi:10.1007/s10664-022-10194-7 , abstract =

work page doi:10.1007/s10664-022-10194-7

[44] [45]

1992 , publisher=

Statistical methods for psychology , author=. 1992 , publisher=

1992

[45] [46]

doi:10.5281/zenodo.19683666 , url =

Jiang, Bowen and Hagel, Nathan and Cheng, Haowei , year =. doi:10.5281/zenodo.19683666 , url =

work page doi:10.5281/zenodo.19683666

[46] [47]

Nathan Hagel and Nicolas Hili and Alexander Bartel and Anne Koziolek , title =. 22nd. 2025 , doi =

2025

[47] [48]

Turning Low-Code Development Platforms into True No-Code with LLMs , booktitle =

Nathan Hagel and Nicolas Hili and Didier Schwab , editor =. Turning Low-Code Development Platforms into True No-Code with LLMs , booktitle =. 2024 , doi =

2024

[48] [49]

International conference on model driven engineering languages and systems , pages=

UML2Alloy: A challenging model transformation , author=. International conference on model driven engineering languages and systems , pages=. 2007 , organization=

2007

[49] [50]

gradient descent

Automatic prompt optimization with "gradient descent" and beam search , author=. arXiv preprint arXiv:2305.03495 , doi =. 2023 , month =

arXiv 2023

[50] [51]

arXiv preprint arXiv:2311.05661 , year=

Prompt engineering a prompt engineer , author=. arXiv preprint arXiv:2311.05661 , year=

arXiv

[51] [52]

2018 , note =

Eclipse Foundation , title =. 2018 , note =

2018

[52] [53]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

[53] [54]

Vitruv-CaseStudies , year =

[54] [55]

ATL Zoo Benchmark , year =

[55] [56]

Eclipse Epsilon , year =

[56] [57]

org.eclipse.qvto , year =

[57] [58]

ACM Transactions on Software Engineering and Methodology , year=

A survey on llm-based code generation for low-resource and domain-specific programming languages , author=. ACM Transactions on Software Engineering and Methodology , year=

[58] [59]

Advances in Neural Information Processing Systems , volume=

Grammar prompting for domain-specific language generation with large language models , author=. Advances in Neural Information Processing Systems , volume=

[59] [60]

General-purpose Languages: A Historical Perspective on ATL vs

Dedicated Model Transformation Languages vs. General-purpose Languages: A Historical Perspective on ATL vs. Java. , author=. MODELSWARD , pages=

[60] [61]

doi:10.5445/IR/1000193410 , url =

Large Language Models in Model-Driven Engineering: A Systematic Mapping Study , author =. doi:10.5445/IR/1000193410 , url =

work page doi:10.5445/ir/1000193410

[61] [62]

CoRR , volume =

Domenico Amalfitano and Andreas Metzger and Marco Autili and Tommaso Fulcini and Tobias Hey and Jan Keim and Patrizio Pelliccione and Vincenzo Scotti and Anne Koziolek and Raffaela Mirandola and Andreas Vogelsang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.26275 , eprinttype =. 2510.26275 , timestamp =

work page doi:10.48550/arxiv.2510.26275 2025

[62] [63]

CoRR , volume =

Yusei Ishimizu and Takuto Yamauchi and Sinan Chen and Jinyu Cai and Jialong Li and Kenji Tei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.07261 , eprinttype =. 2512.07261 , timestamp =

work page doi:10.48550/arxiv.2512.07261 2025

[63] [64]

Towards Efficient Discrete Controller Synthesis: Semantics-Aware Stepwise Policy Design via LLM , year=

Ishimizu, Yusei and Li, Jialong and Yamauchi, Takuto and Chen, Sinan and Cai, Jinyu and Hirano, Takanori and Tei, Kenji , booktitle=. Towards Efficient Discrete Controller Synthesis: Semantics-Aware Stepwise Policy Design via LLM , year=

[64] [65]

2024 , url =

Jialong Li and Mingyue Zhang and Nianyu Li and Danny Weyns and Zhi Jin and Kenji Tei , title =. 2024 , url =. doi:10.1145/3686803 , timestamp =

work page doi:10.1145/3686803 2024

[65] [66]

Exploring the Potential of Large Language Models in Self-adaptive Systems , booktitle =

Jialong Li and Mingyue Zhang and Nianyu Li and Danny Weyns and Zhi Jin and Kenji Tei , editor =. Exploring the Potential of Large Language Models in Self-adaptive Systems , booktitle =. 2024 , url =. doi:10.1145/3643915.3644088 , timestamp =

work page doi:10.1145/3643915.3644088 2024