Leveraging LLMs for Grammar Adaptation: A Study on Metamodel-Grammar Co-Evolution
Pith reviewed 2026-05-21 04:20 UTC · model grok-4.3
The pith
LLMs learn grammar adaptations from prior metamodel-grammar pairs to automatically update grammars after new evolutions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models can be prompted to learn grammar adaptation patterns from earlier metamodel-grammar pairs and then apply those patterns to update grammars after new metamodel evolutions. Evaluation across six real-world Xtext domain-specific languages, with four used for developing the prompts and two held out for testing, showed that all three tested LLMs produced adaptations with 100% consistency and output similarity. In a longitudinal study of the QVTo language across three evolution steps, the LLM approach carried forward the learned adaptations without requiring any manual grammar editing, whereas the rule-based baseline needed manual adjustments for two of the three transitions.
What carries the argument
Prompting strategies that supply LLMs with examples of past grammar adaptations from metamodel-grammar version pairs so the models can infer and generate updates for a new evolution step.
If this is right
- Grammars stay consistent with evolved metamodels through automated application of learned adaptations.
- Adaptations learned in one step transfer and apply across later evolution steps in the same language without re-work.
- Complex grammar scenarios become manageable where rule-based methods achieve only partial success.
- Metamodel conformance is preserved in the LLM-generated grammar updates.
- Consistency falls for grammars on the scale of hundreds of rules, as seen with EAST-ADL.
Where Pith is reading between the lines
- Similar prompting could support co-evolution of other MDE artifacts such as transformations or OCL constraints when metamodels change.
- A hybrid workflow might use rules for routine cases and LLMs for the complex changes where they currently excel.
- Long-lived DSL projects could reduce ongoing maintenance costs if the learned adaptations prove stable over many versions.
- The observed size limitation suggests testing modular or chunked prompting to extend the method to larger grammars.
Load-bearing premise
Prompting strategies developed on the four training DSLs will transfer to unseen test DSLs and to future evolution steps of QVTo without requiring per-DSL manual prompt engineering or additional fine-tuning.
What would settle it
Applying the same prompts to a new unseen DSL evolution or the next QVTo step and measuring whether adaptation consistency remains at 100% or drops sharply without any prompt changes.
read the original abstract
In model-driven engineering, metamodel evolution leads to the need to adapt corresponding grammars to maintain consistency, which typically requires tedious manual work. Existing rule-based methods can achieve partial automation but have limitations when handling complex grammar scenarios. This paper proposes a Large Language Model-based approach that automatically applies adaptations to new grammars after evolution by learning grammar adaptations from previous versions. We evaluated this approach on six real-world Xtext domain-specific languages, using four DSLs as a training set to develop prompting strategies, two DSLs as a test set for validation, and conducting a longitudinal case study on QVTo. The evaluation used three Large Language Models (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) and measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance. Results show that on the test set, all three LLMs achieved 100% adaptation consistency and output similarity, while the rule-based approach achieved only 84.21% on DOT and 62.50% on Xcore. In the QVTo longitudinal study, the LLM-based approach successfully reused learned adaptations across all three evolution steps without manual grammar editing, while the rule-based approach required manual adjustments in two of three transitions. However, on large-scale grammars (EAST-ADL, 297 rules), LLMs' adaptation consistency was far below 90%. This study demonstrates the advantages of LLM-based approaches in handling complex grammar scenarios, while revealing their limitations in large-scale grammar adaptation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an LLM-based method for automatically adapting Xtext grammars to metamodel evolutions in model-driven engineering. It trains prompting strategies on four DSLs, evaluates on two held-out test DSLs (DOT, Xcore) and a longitudinal QVTo case study across three evolution steps, and compares three LLMs (Claude Sonnet 4.5, ChatGPT 5.1, Gemini 3) against a rule-based baseline using metrics of rule-level adaptation consistency, output similarity, and metamodel conformance. The abstract reports 100% consistency and similarity for all LLMs on the test set versus 84.21% and 62.50% for the rule-based method on DOT and Xcore, successful reuse without manual edits in QVTo, but sub-90% consistency on the large EAST-ADL grammar (297 rules).
Significance. If the generalization claim holds, the work provides concrete evidence that LLMs can reduce manual grammar editing effort in MDE co-evolution scenarios more effectively than rule-based baselines, with reproducible numeric results across multiple models and a longitudinal study showing cross-step reuse. This could influence tool support for grammar maintenance in Xtext-based DSLs, though the reported scalability limit on large grammars qualifies the practical impact.
major comments (2)
- [Evaluation (test set)] Evaluation section (test-set results): The headline claim of 100% adaptation consistency and output similarity on DOT and Xcore depends on the prompting strategies developed from the four training DSLs transferring without per-DSL adjustments. The manuscript provides no explicit prompt templates, example-selection procedure, or confirmation that the same prompts were used unchanged on the held-out test DSLs; without this, the performance gap versus the rule-based baseline cannot be attributed solely to LLM generalization rather than implicit tuning.
- [QVTo case study] QVTo longitudinal study: The claim that the LLM approach reused learned adaptations across all three evolution steps without manual grammar editing is central to demonstrating practical utility. However, the description lacks detail on the exact mechanism for carrying adaptations forward (e.g., whether prior outputs were appended to prompts or how metamodel changes were encoded), making it difficult to assess whether the result reflects robust co-evolution or case-specific prompting.
minor comments (2)
- [Abstract / Evaluation setup] The abstract and evaluation should include a brief table summarizing the six DSLs (rule counts, evolution steps) to allow readers to assess selection bias and scale.
- [Limitations] The large-grammar failure case (EAST-ADL) would benefit from a short error breakdown (e.g., which rule types failed) even if full analysis is deferred to future work.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped us improve the clarity and reproducibility of the manuscript. We address each major comment below and have revised the paper to incorporate additional methodological details where needed.
read point-by-point responses
-
Referee: Evaluation section (test-set results): The headline claim of 100% adaptation consistency and output similarity on DOT and Xcore depends on the prompting strategies developed from the four training DSLs transferring without per-DSL adjustments. The manuscript provides no explicit prompt templates, example-selection procedure, or confirmation that the same prompts were used unchanged on the held-out test DSLs; without this, the performance gap versus the rule-based baseline cannot be attributed solely to LLM generalization rather than implicit tuning.
Authors: We agree that explicit documentation of the prompting strategies is essential for attributing the results to generalization. The original manuscript describes the high-level process of developing strategies on the training DSLs but does not include the concrete templates or selection procedure. In the revised version we have added a new subsection (4.2.1) that provides the full prompt templates, the example-selection algorithm (based on embedding similarity between metamodel change descriptions and prior grammar rules), and an explicit statement that these identical prompts were applied without modification to the held-out DOT and Xcore DSLs. This addition directly supports the claim that the 100 % consistency reflects transfer rather than per-DSL tuning. revision: yes
-
Referee: QVTo longitudinal study: The claim that the LLM approach reused learned adaptations across all three evolution steps without manual grammar editing is central to demonstrating practical utility. However, the description lacks detail on the exact mechanism for carrying adaptations forward (e.g., whether prior outputs were appended to prompts or how metamodel changes were encoded), making it difficult to assess whether the result reflects robust co-evolution or case-specific prompting.
Authors: We acknowledge that the original description of the reuse mechanism was insufficiently detailed. We have expanded Section 5.3 with a precise account of the longitudinal prompting procedure: each new prompt contains (1) the immediately preceding grammar version, (2) the metamodel evolution encoded as a structured diff, and (3) the grammar rules adapted in the previous step as few-shot examples. No manual edits were performed between steps. We have also inserted pseudocode illustrating the incremental prompt construction. These additions allow readers to evaluate whether the observed reuse demonstrates robust co-evolution. revision: yes
Circularity Check
No circularity: empirical results on held-out DSLs are independent of internal definitions
full rationale
The paper reports an empirical study that splits six DSLs into a training set of four (used only to develop prompting strategies) and a held-out test set of two, plus a separate longitudinal QVTo case study. No equations, fitted parameters, or self-definitional derivations appear in the provided text. Performance metrics (100% adaptation consistency on test DSLs) are measured directly against external baselines and ground-truth grammars rather than being computed from quantities defined inside the paper. The prompting strategies are described as developed on training data and applied to test data; even if transfer assumptions are debatable, this does not constitute a reduction of the reported outcome to its own inputs by construction. Self-citations, if present, are not load-bearing for the central empirical claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can generalize grammar adaptation rules from a small set of training examples to new metamodel versions
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluated this approach on six real-world Xtext domain-specific languages... measured grammar adaptation quality from three dimensions: grammar rule-level adaptation consistency, output similarity, and metamodel conformance.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
(2025).Claude Sonnet 4.5.Retrieved from https:// www.anthropic.com/claude/sonnet (Accessed December,
Anthropic. (2025).Claude Sonnet 4.5.Retrieved from https:// www.anthropic.com/claude/sonnet (Accessed December,
work page 2025
-
[2]
Arulmohan, S., Meurs, M.-J., & Mosser, S. (2023). Extract- ing domain models from textual requirements in the era of large language models. In2023 acm/ieee international con- ference on model driven engineering languages and systems companion (models-c)(pp. 580–587). Astekin, M., Hort, M., & Moonen, L. (2024). An exploratory study on how non-determinism i...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Eclipse Foundation. (2018).Eclipse xcore wiki.Retrieved from https://git.eclipse.org/c/emf/org.eclipse.emf.git/tree/ plugins/org.eclipse.emf.ecore.xcore/src/org/eclipse/emf/ ecore/xcore/Xcore.xtext (Accessed February,
work page 2018
-
[4]
(2025).Xtext homepage.(https://www .eclipse.org/Xtext/
Eclipse Foundation. (2025).Xtext homepage.(https://www .eclipse.org/Xtext/. Last accessed Nov
work page 2025
-
[5]
Eclipse Foundation AISBL. (2020).Dot xtext gram- mar.Retrieved from https://github.com/eclipse/gef/blob/ master/org.eclipse.gef.dot/src/org/eclipse/gef/dot/internal/ language/Dot.xtext (Accessed February,
work page 2020
-
[6]
Erdweg, S., Van Der Storm, T., Völter, M., Boersma, M., Bosman, R., Cook, W. R., . . . others (2013). The state of the art in language workbenches: Conclusions from the language workbench challenge. InInternational conference on software language engineering(pp. 197–217). García, J., Diaz, O., & Azanza, M. (2012). Model transfor- mation co-evolution: A se...
work page 2013
-
[7]
Hebig, R., Khelladi, D. E., & Bendraou, R. (2016). Approaches to co-evolution of metamodels and models: A survey.IEEE Transactions on Software Engineering,43(5), 396–414. Hou, X., Zhao, Y ., Liu, Y ., Yang, Z., Wang, K., Li, L., . . . Wang, H. (2024). Large language models for software engineer- ing: A systematic literature review.ACM Transactions on Soft...
work page 2016
-
[8]
Jiang, J., Li, Z., Qin, H., Jiang, M., Luo, X., Wu, X., . . . Chen, T. (2025, April). Unearthing gas-wasting code smells in smart contracts with large language models.IEEE Trans. Softw. Eng.,51(4), 879–903. Retrieved from https://doi.org/10.1109/ TSE.2024.3491578 doi: 10.1109/TSE.2024.3491578 Kebaili, Z. K., Khelladi, D. E., Acher, M., & Barais, O. (2024,...
-
[9]
Meyers, B., & Vangheluwe, H. (2011). A frame- work for evolution of modelling languages.Science of Computer Programming,76(12), 1223-1246. Re- trieved from https://www.sciencedirect.com/science/article/ pii/S0167642311000141 (Special Issue on Software Evo- lution, Adaptability and Variability) doi: https://doi.org/ 10.1016/j.scico.2011.01.002 Misha Rodche...
-
[10]
Netz, L., Michael, J., & Rumpe, B. (2024). From natural language to web applications: Using large language models for model-driven software engineering. InModellierung 2024 (pp. 179–195). OpenAI. (2025).GPT-5.1: A smarter, more conversational ChatGPT.Retrieved from https://openai.com/index/gpt-5-1/ (Accessed December,
work page 2024
-
[11]
Ouyang, S., Zhang, J. M., Harman, M., & Wang, M. (2025). An empirical study of the non-determinism of chatgpt in code generation.ACM Transactions on Software Engineering and Methodology,34(2), 1–28. LLM-based Metamodel-Grammar Co-Evolution 13 Paige, R. F., Kolovos, D. S., & Polack, F. A. (2014). A tuto- rial on metamodelling for grammar researchers.Scienc...
work page 2025
-
[12]
Pearce, H., Tan, B., Krishnamurthy, P., Khorrami, F., Karri, R., & Dolan-Gavitt, B. (2022). Pop quiz! can a large lan- guage model help with reverse engineering?arXiv preprint arXiv:2202.01142. Ráth, I., Ökrös, A., & Varró, D. (2010). Synchronization of abstract and concrete syntax in domain-specific modeling languages: By mapping models and live transfor...
-
[13]
(2016).MOF Query/View/- Transformation.https://www.omg.org/spec/QVT, last ac- cessed December
The Object Management Group. (2016).MOF Query/View/- Transformation.https://www.omg.org/spec/QVT, last ac- cessed December
work page 2016
-
[14]
Tolvanen, J.-P., Kelly, S., Di Rocco, J., Pierantonio, A., & Tinella, G. (2025). A framework for evaluating tool support for co-evolution of modeling languages, tools and models. Software and Systems Modeling,24(2), 311–338. Wang, Y ., Le, H., Gotmare, A., Bui, N., Li, J., & Hoi, S. (2023). Codet5+: Open code large language models for code under- standing...
-
[15]
Zhuo, J., Zhang, S., Fang, X., Duan, H., Lin, D., & Chen, K. (2024). Prosa: Assessing and understanding the prompt sensitivity of llms.arXiv preprint arXiv:2410.12405. About the authors Weixing Zhangis a Postdoctoral researcher at Karlsruhe In- stitute of Technology. You can contact the author at weix- ing.zhang@kit.edu or visit https://wilson008.github.i...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.