LLM4MTLs: Automated Generation and Empirical Evaluation of Model Transformation Languages
Pith reviewed 2026-06-25 22:37 UTC · model grok-4.3
The pith
Few-shot prompting improves syntactic quality of LLM-generated code across four model transformation languages, while semantic gains remain uneven and language-dependent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An automated workflow systematically explores prompt constructions that combine few-shot prompting, grammar prompting, and helper method inclusion, then evaluates them on four MTLs with syntactic and semantic metrics derived from executable reference scripts and manually written test suites. Few-shot prompting consistently improves syntactic quality across all four languages while semantic correctness gains are uneven and language-dependent; for ATL, Pass@1 remains unchanged across strategies and models. Grammar prompting stabilizes generation when combined with few-shot examples but can be ineffective or counterproductive alone. LLM choice influences syntactic correctness and similarity for
What carries the argument
The LLM4MTLs workflow that constructs prompt variants and scores them with syntactic and semantic metrics on reference scripts and test suites for ATL, ETL, QVTo, and Reactions.
If this is right
- Few-shot prompting can be applied to raise syntactic validity when generating code in ATL, ETL, QVTo, or Reactions.
- Grammar prompting produces more stable output only when paired with few-shot examples rather than used alone.
- Helper method inclusion can further improve results for certain language-model pairs.
- LLM selection affects syntactic metrics more noticeably for ETL and QVTo than for semantic metrics across the board.
- ATL shows no measurable semantic improvement from any of the tested prompting strategies.
Where Pith is reading between the lines
- The same workflow could be applied to other domain-specific languages whose training data is similarly sparse.
- Teams facing MTL development might first adopt few-shot prompting for quick syntactic fixes before investing in grammar engineering.
- Semantic correctness may ultimately require methods beyond prompting, such as retrieval of transformation patterns or targeted fine-tuning.
- The language-dependent semantic results point to differences in how each MTL's structure interacts with LLM token prediction.
- The evaluation suite itself becomes a shared benchmark that later studies can extend with additional languages or metrics.
Load-bearing premise
The manually written test suites used to measure semantic correctness are representative and free of bias in coverage or construction.
What would settle it
Re-running the identical evaluation on a fresh collection of test cases that deliberately includes more edge cases and transformations absent from the original suites, then observing whether the reported syntactic gains persist while semantic gains remain language-dependent.
read the original abstract
Model transformation languages (MTLs) are domain-specific languages for transforming models conforming to a given metamodel into other models, including textual models such as source code. Developing correct model transformations is challenging, requiring both language-specific and domain knowledge, and motivating the use of large language models (LLMs) for MTL code generation. However, due to limited training data and executable examples, LLM-generated MTL code is often not syntactically valid or semantically usable out of the box. This paper presents LLM4MTLs, an automated workflow for constructing and comparing prompting strategies for LLM-generated MTL code, together with an evaluation suite and an empirical evaluation. The workflow systematically explores prompt constructions combining few-shot prompting, grammar prompting, and helper method inclusion, and evaluates them using syntactic and semantic metrics. We construct an evaluation suite spanning four MTLs (ATL, ETL, QVTo, and the Reactions language) with executable reference scripts and manually written test suites, and evaluate across three LLMs. We find that few-shot prompting consistently improves syntactic quality across all four MTLs while gains in semantic correctness are uneven and language-dependent. For ATL, Pass@1 remains unchanged across all strategies and models, indicating that few-shot prompting improves surface-level syntax more readily than deep transformation semantics. Grammar prompting stabilizes code generation when combined with few-shot examples, but in isolation it can be ineffective or even counterproductive for certain model-language combinations. Including helper methods as a complementary amplifier can also be beneficial. Finally, LLM choice influences syntactic correctness and similarity for certain MTLs, particularly ETL and QVTo, while its influence on semantic correctness remains limited.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents LLM4MTLs, an automated workflow for exploring and comparing prompting strategies (few-shot, grammar prompting, helper methods) for LLM generation of code in four model transformation languages (ATL, ETL, QVTo, Reactions). It evaluates three LLMs using syntactic metrics (parser success, similarity) and semantic metrics (Pass@1 on executable reference scripts paired with manually written test suites), reporting that few-shot prompting yields consistent syntactic gains across all MTLs while semantic gains are uneven and language-dependent, with ATL showing no Pass@1 improvement across any strategy or model.
Significance. If the results hold, the work supplies empirical evidence on the relative effectiveness of prompting techniques for an under-served domain (MTL code generation) where training data is scarce. The systematic multi-language, multi-LLM design and dual syntactic/semantic evaluation are strengths; the finding that syntactic improvements do not reliably produce semantic gains is a practically useful observation for model-driven engineering. The automated workflow itself supports reproducibility.
major comments (1)
- [Evaluation suite] Evaluation suite description: The paper provides no information on the construction, size, coverage criteria (metamodel element coverage, rule interactions, edge cases), or validation of the manually written test suites used to compute semantic correctness (Pass@1). Because the central claims about uneven, language-dependent semantic gains and the ATL null result rest exclusively on these suites, the absence of such details leaves open the possibility that reported differences are artifacts of test-suite bias rather than properties of the prompting strategies. This is load-bearing for the semantic half of the empirical contribution.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights an important gap in the presentation of our evaluation methodology. We address the major comment below and commit to a revision that strengthens the semantic evaluation section.
read point-by-point responses
-
Referee: [Evaluation suite] Evaluation suite description: The paper provides no information on the construction, size, coverage criteria (metamodel element coverage, rule interactions, edge cases), or validation of the manually written test suites used to compute semantic correctness (Pass@1). Because the central claims about uneven, language-dependent semantic gains and the ATL null result rest exclusively on these suites, the absence of such details leaves open the possibility that reported differences are artifacts of test-suite bias rather than properties of the prompting strategies. This is load-bearing for the semantic half of the empirical contribution.
Authors: We agree that the current manuscript lacks sufficient detail on test-suite construction, which is necessary to substantiate the semantic results. In the revised version we will add a dedicated subsection (approximately 4.2) describing: (1) the size of each suite (number of test cases per MTL), (2) coverage criteria explicitly including metamodel element coverage, rule-interaction coverage, and selected edge cases, (3) the process used to validate the suites against the executable reference scripts, and (4) steps taken to reduce bias (e.g., independent review of test cases). These additions will allow readers to assess whether the reported language-dependent semantic differences reflect prompting strategy properties rather than suite artifacts. revision: yes
Circularity Check
No circularity: purely empirical evaluation with no derivations or fitted predictions
full rationale
The paper reports an empirical study comparing prompting strategies for LLM-generated MTL code across four languages and three models. It defines syntactic metrics (parser success, similarity) and semantic metrics (Pass@1 via manually written test suites with reference scripts) and directly measures outcomes. No equations, parameter fitting presented as prediction, self-citation chains, or ansatzes are used to derive results; all claims rest on explicit experimental runs. The evaluation suite construction is described at a high level but does not reduce any reported finding to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Manually written test suites provide an unbiased and sufficiently complete measure of semantic correctness for the chosen transformations.
- domain assumption Syntactic validity and Pass@1 on the provided tests are adequate proxies for practical usability of generated MTL code.
Reference graph
Works this paper leans on
-
[1]
On the Assessment of Generative
Cámara, Javier and Troya, Javier and Burgueño, Lola and Vallecillo, Antonio , year =. On the Assessment of Generative. Software and Systems Modeling , volume =
-
[2]
Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs , booktitle =
Zhang, Weixing and Hebig, Regina and Str. Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs , booktitle =. 2025 , address =
2025
-
[3]
arXiv preprint arXiv:2602.11904 , year=
Leveraging LLMs to support co-evolution between definitions and instances of textual DSLs: A Systematic Evaluation , author=. arXiv preprint arXiv:2602.11904 , year=
-
[4]
Model-Based Trust Analysis of LLM Conversations , year =
Buchmann, Thomas , year =. Prompting. Proceedings of the. doi:10.1145/3652620.3687802 , isbn =
-
[5]
Cibrián, Eduardo and Olivert Iserte, José Francisco and Casella, Francesco and García Rodríguez, Mario and Alvarez-Rodríguez, Jose María and Llorens, Juan , year =. Automating. doi:10.2139/ssrn.5564442 , pubstate =. 5564442 , eprinttype =
- [6]
-
[7]
Kolovos, Dimitris and Garcia-Dominguez, Antonio , title =. Proceedings of the 25th. 2022 , isbn =. doi:10.1145/3550356.3556507 , abstract =
-
[9]
About the
OMG , year =. About the
-
[10]
UPGRADE, The European Journal for the Informatics Professional , volume=
Model differences in the Eclipse Modeling Framework , author=. UPGRADE, The European Journal for the Informatics Professional , volume=
-
[11]
Klare, Heiko , year = 2021, publisher =. Building. doi:10.5445/IR/1000133724 , urldate =
-
[12]
Orthographic
Atkinson, Colin and Stoll, Dietmar and Bostan, Philipp , editor =. Orthographic. Evaluation of. 2010 , series =
2010
-
[13]
Docker: Accelerated Container Application Development , url =
-
[14]
vitruv-tools/Methodologist-Template , url =
-
[15]
2025 , langid =
GPT-5.1 , url =. 2025 , langid =
2025
-
[16]
2025 , titleaddon =
Gemini 2.5. 2025 , titleaddon =
2025
-
[17]
Bleu: a method for automatic evaluation of machine translation
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , year =. Proceedings of the 40th. doi:10.3115/1073083.1073135 , shorttitle =
-
[18]
chr F : character n-gram F -score for automatic MT evaluation
Popović, Maja , editor =. Proceedings of the. 2015 , month =. doi:10.18653/v1/W15-3049 , shorttitle =
-
[19]
Text Summarization Branches Out , publisher =
Lin, Chin-Yew , year =. Text Summarization Branches Out , publisher =
-
[20]
Proceedings of the
Banerjee, Satanjeev and Lavie, Alon , editor =. Proceedings of the. 2005 , month =
2005
-
[21]
Bassamzadeh, Nastaran and Methani, Chhaya , year =. A Comparative Study of. doi:10.48550/arXiv.2407.02742 , shorttitle =. 2407.02742 [cs] , keywords =
-
[22]
Paul, Debalina Ghosh and Zhu, Hong and Bayley, Ian , year =. Benchmarks and. doi:10.48550/arXiv.2406.12655 , shorttitle =. 2406.12655 [cs] , note =
-
[23]
arXiv e-prints , pages=
LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation , author=. arXiv e-prints , pages=
-
[24]
Mens, Tom and Van Gorp, Pieter , year =. A. Electronic Notes in Theoretical Computer Science , shortjournal =. doi:10.1016/j.entcs.2005.10.021 , langid =
-
[25]
and Langhammer, Michael and Werle, Dominik and Burger, Erik and Reussner, Ralf , year =
Klare, Heiko and Kramer, Max E. and Langhammer, Michael and Werle, Dominik and Burger, Erik and Reussner, Ralf , year =. Enabling. Journal of Systems and Software , shortjournal =. doi:10.1016/j.jss.2020.110815 , langid =
-
[26]
Science of Computer Programming , shortjournal =
Jouault, Frédéric and Allilaire, Freddy and Bézivin, Jean and Kurtev, Ivan , year =. Science of Computer Programming , shortjournal =. doi:10.1016/j.scico.2007.08.002 , abstract =
-
[27]
Garaccione, Giacomo and Calabrese, Diego Maria and Coppola, Riccardo and Ardito, Luca , year =. A
-
[28]
InMODELS New Ideas and Emerging Results (NIER) Track 2025(Grand Rapids, MI, USA)
Eisenberg, Martin and Klikovits, Stefan and Wimmer, Manuel and Wielan, Konrad , title =. Proceedings of the. 2025 , month =. doi:10.1109/MODELS67397.2025.00032 , publisher =
-
[29]
Joel, Sathvik and Wu, Jie and Fard, Fatemeh , year =. A. ACM Transactions on Software Engineering and Methodology , shortjournal =. doi:10.1145/3770084 , abstract =
-
[30]
Brambilla, Marco and Cabot, Jordi and Wimmer, Manuel , year =. Model-. doi:10.1007/978-3-031-02549-5 , isbn =
-
[31]
2019 , month =
Meta. 2019 , month =
2019
-
[32]
OMG , year =. Object
-
[33]
2025 , abstract =
N8n.Io -. 2025 , abstract =
2025
-
[34]
Pontes Miranda, James William and Bruneliere, Hugo and Tisi, Massimo and Sunyé, Gerson , year =. Towards an. Proceedings of the 17th. doi:10.1145/3687997.3695650 , abstract =
-
[35]
Burgueño, Lola and Di Ruscio, Davide and Sahraoui, Houari and Wimmer, Manuel , date =. Automation in. doi:10.1145/3712008 , abstract =
-
[36]
Luaces, Miguel and Garcia-Gonzalez, Daniel , date =
Lamas, Victor and R. Luaces, Miguel and Garcia-Gonzalez, Daniel , date =. Proceedings of the. doi:10.1145/3652620.3687782 , abstract =
-
[37]
Abukhalaf, Seif and Hamdaqa, Mohammad and Khomh, Foutse , year =. Proceedings of the 2024. doi:10.1145/3650105.3652290 , abstract =
-
[38]
2025 , booktitle =
Duy Dao and Alessio Bucaioni and Antonio Cicchetti , title =. 2025 , booktitle =
2025
-
[39]
Mündler, Niels and He, Jingxuan and Wang, Hao and Sen, Koushik and Song, Dawn and Vechev, Martin , date =. Type-. doi:10.1145/3729274 , abstract =
-
[40]
Kazai, Gabriel and Osei, Ronnie Agyeiwaa and Bucaioni, Alessio and Cicchetti, Antonio , abstract =. Model
-
[41]
The Families to Persons Case , booktitle =
Anthony Anjorin and Thomas Buchmann and Bernhard Westfechtel , editor =. The Families to Persons Case , booktitle =. 2017 , timestamp =
2017
-
[42]
Varró, Dániel , editor =. Model. Model. 2006 , pages =. doi:10.1007/11880240_29 , abstract =
-
[43]
2003 , url =
Anneke Kleppe and Jos Warmer and Wim Bast , title =. 2003 , url =
2003
-
[44]
Höppner, Stefan and Haas, Yves and Tichy, Matthias and Juhnke, Katharina , year =. Advantages and. Empirical Software Engineering , shortjournal =. doi:10.1007/s10664-022-10194-7 , abstract =
-
[45]
1992 , publisher=
Statistical methods for psychology , author=. 1992 , publisher=
1992
-
[46]
doi:10.5281/zenodo.19683666 , url =
Jiang, Bowen and Hagel, Nathan and Cheng, Haowei , year =. doi:10.5281/zenodo.19683666 , url =
-
[47]
Nathan Hagel and Nicolas Hili and Alexander Bartel and Anne Koziolek , title =. 22nd. 2025 , doi =
2025
-
[48]
Turning Low-Code Development Platforms into True No-Code with LLMs , booktitle =
Nathan Hagel and Nicolas Hili and Didier Schwab , editor =. Turning Low-Code Development Platforms into True No-Code with LLMs , booktitle =. 2024 , doi =
2024
-
[49]
International conference on model driven engineering languages and systems , pages=
UML2Alloy: A challenging model transformation , author=. International conference on model driven engineering languages and systems , pages=. 2007 , organization=
2007
-
[50]
Automatic prompt optimization with "gradient descent" and beam search , author=. arXiv preprint arXiv:2305.03495 , doi =. 2023 , month =
arXiv 2023
-
[51]
arXiv preprint arXiv:2311.05661 , year=
Prompt engineering a prompt engineer , author=. arXiv preprint arXiv:2311.05661 , year=
-
[52]
2018 , note =
Eclipse Foundation , title =. 2018 , note =
2018
-
[53]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[54]
Vitruv-CaseStudies , year =
-
[55]
ATL Zoo Benchmark , year =
-
[56]
Eclipse Epsilon , year =
-
[57]
org.eclipse.qvto , year =
-
[58]
ACM Transactions on Software Engineering and Methodology , year=
A survey on llm-based code generation for low-resource and domain-specific programming languages , author=. ACM Transactions on Software Engineering and Methodology , year=
-
[59]
Advances in Neural Information Processing Systems , volume=
Grammar prompting for domain-specific language generation with large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
General-purpose Languages: A Historical Perspective on ATL vs
Dedicated Model Transformation Languages vs. General-purpose Languages: A Historical Perspective on ATL vs. Java. , author=. MODELSWARD , pages=
-
[61]
doi:10.5445/IR/1000193410 , url =
Large Language Models in Model-Driven Engineering: A Systematic Mapping Study , author =. doi:10.5445/IR/1000193410 , url =
-
[62]
Domenico Amalfitano and Andreas Metzger and Marco Autili and Tommaso Fulcini and Tobias Hey and Jan Keim and Patrizio Pelliccione and Vincenzo Scotti and Anne Koziolek and Raffaela Mirandola and Andreas Vogelsang , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.26275 , eprinttype =. 2510.26275 , timestamp =
-
[63]
Yusei Ishimizu and Takuto Yamauchi and Sinan Chen and Jinyu Cai and Jialong Li and Kenji Tei , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2512.07261 , eprinttype =. 2512.07261 , timestamp =
-
[64]
Towards Efficient Discrete Controller Synthesis: Semantics-Aware Stepwise Policy Design via LLM , year=
Ishimizu, Yusei and Li, Jialong and Yamauchi, Takuto and Chen, Sinan and Cai, Jinyu and Hirano, Takanori and Tei, Kenji , booktitle=. Towards Efficient Discrete Controller Synthesis: Semantics-Aware Stepwise Policy Design via LLM , year=
-
[65]
Jialong Li and Mingyue Zhang and Nianyu Li and Danny Weyns and Zhi Jin and Kenji Tei , title =. 2024 , url =. doi:10.1145/3686803 , timestamp =
-
[66]
Exploring the Potential of Large Language Models in Self-adaptive Systems , booktitle =
Jialong Li and Mingyue Zhang and Nianyu Li and Danny Weyns and Zhi Jin and Kenji Tei , editor =. Exploring the Potential of Large Language Models in Self-adaptive Systems , booktitle =. 2024 , url =. doi:10.1145/3643915.3644088 , timestamp =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.