From Text to DSL: Evaluating Grammar-Based Model Generation Using Open LLMs
Pith reviewed 2026-05-20 16:33 UTC · model grok-4.3
The pith
Open LLMs as small as 7 billion parameters generate valid DSL models from natural language descriptions using few-shot prompting alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that open LLMs can generate DSL-conformant models from natural language using only few-shot prompting and no fine-tuning. By requiring the models to create both UI and data models entirely from scratch, the work tests their capacity to infer domain-specific relationships and preserve consistency across interconnected artifacts. Structured evaluation through parsing and expert feedback across 39 models shows that several compact models, such as gemma3:12b and mistral:7b-instruct, approach or match the performance of much larger models on the metrics of syntactic validity, semantic completeness, and inter-model reference consistency.
What carries the argument
Few-shot prompting applied to open LLMs to produce grammar-conformant, mutually consistent UI and data models evaluated on syntactic validity, semantic completeness, and reference consistency.
If this is right
- Teams can adopt smaller open models for DSL generation tasks without incurring the cost or latency of large proprietary models.
- Model-driven engineering workflows become feasible in environments where local deployment and data privacy are required.
- The same prompting approach generalizes across models that play different structural roles, such as UI versus data models.
- No additional training is needed to achieve grammar-conformant output when a modest number of examples is supplied in the prompt.
Where Pith is reading between the lines
- The results open the possibility of embedding DSL generation directly into lightweight development tools that run on ordinary hardware.
- Similar prompting strategies could be tested on other grammar-constrained generation tasks outside traditional MDE, such as configuration files or API schemas.
- Future experiments might vary the number of examples or the complexity of the domain to map the point at which model size stops being the dominant factor.
Load-bearing premise
The chosen metrics of syntactic validity, semantic completeness, and inter-model reference consistency, together with the selected test cases and expert feedback, are sufficient to show practical utility for real-world model-driven engineering tasks.
What would settle it
A new test set of domain descriptions in which compact models such as mistral:7b-instruct repeatedly produce outputs that fail automatic parsing or expert review for consistency between the generated UI and data models would falsify the central claim.
Figures
read the original abstract
Large Language Models (LLMs) have shown increasing potential in automating model-driven software engineering tasks, particularly in generating models conforming to Domain Specific Languages (DSLs) from natural language. While most existing approaches rely on large proprietary models, their high cost and limited deployability hinder broader adoption. In this paper, we evaluate whether open-source LLMs of varying sizes (0.5B to 32B parameters) can generate DSL-conformant models using only few-shot prompting, without any fine-tuning. Our evaluation focuses on key model-driven engineering (MDE) requirements, including syntactic validity, semantic completeness, and inter-model reference consistency. We extend our prior work by moving from generating user interface models (referred to as "UI models" in this paper) over fixed, predefined data schemas ("data models") to generating both the UI and data models entirely from scratch. This shift serves two purposes: first, it highlights the LLM's ability to infer domain-specific relationships and maintain consistency across multiple interconnected models; second, it allows us to generalize earlier findings by testing DSL generation across models of different natures and structural roles. Our structured evaluation combines automatic parsing and expert feedback across 39 LLMs, revealing that several compact models (e.g., \texttt{gemma3:12b}, \texttt{mistral:7b-instruct}) approach or match the quality of much larger models. These findings demonstrate the feasibility of using smaller, open-source LLMs for grammar-conformant DSL generation in MDE workflows, offering a cost-effective and deployable alternative to closed LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates 39 open-source LLMs (0.5B–32B parameters) on generating DSL-conformant models from natural language via few-shot prompting, without fine-tuning. It measures syntactic validity, semantic completeness, and inter-model reference consistency while extending prior work from UI models over fixed data schemas to generating both UI and data models from scratch. The central finding is that compact models such as gemma3:12b and mistral:7b-instruct approach or match the quality of much larger models according to automatic parsing plus expert feedback.
Significance. If the evaluation is robust, the result would be significant for model-driven engineering by showing that smaller, deployable open LLMs can produce grammar-conformant DSL models at quality levels comparable to larger models. This would support cost-effective alternatives to proprietary LLMs and broaden practical adoption in MDE workflows. The broad coverage across 39 models and the shift to generating interconnected models from scratch are positive aspects of the study design.
major comments (2)
- [Abstract] Abstract: the central claim that compact models approach or match larger ones rests on automatic parsing and expert feedback, yet the abstract supplies no quantitative breakdown (per-model scores, number of test cases, inter-rater reliability, or baseline comparisons). Without these details the strength of the parity result cannot be assessed.
- [Evaluation] The evaluation section (and associated tables/figures): the chosen metrics and fixed test cases are asserted to demonstrate practical utility for MDE, but no evidence is given that syntactic validity and semantic completeness correlate with downstream usability (e.g., successful import into modeling tools or maintainability). This assumption is load-bearing for the broader conclusion about cost-effective workflows.
minor comments (2)
- Clarify the exact prompting templates and the domain complexity of the test cases so readers can judge representativeness.
- Add a non-LLM baseline (e.g., template-based or rule-based generator) to contextualize the absolute performance levels.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate revisions that will be incorporated to improve the clarity and robustness of our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that compact models approach or match larger ones rests on automatic parsing and expert feedback, yet the abstract supplies no quantitative breakdown (per-model scores, number of test cases, inter-rater reliability, or baseline comparisons). Without these details the strength of the parity result cannot be assessed.
Authors: We agree that the abstract would benefit from additional quantitative details to allow readers to better evaluate the parity claim. In the revised manuscript, we will update the abstract to include the total number of test cases (across both UI and data model generations), summary performance figures such as syntactic validity percentages for the top compact models (e.g., gemma3:12b and mistral:7b-instruct) relative to larger models, and a brief reference to the expert evaluation process. If inter-rater reliability statistics were computed, they will be noted; otherwise we will clarify the expert review protocol. This change strengthens the abstract without misrepresenting the underlying results. revision: yes
-
Referee: [Evaluation] The evaluation section (and associated tables/figures): the chosen metrics and fixed test cases are asserted to demonstrate practical utility for MDE, but no evidence is given that syntactic validity and semantic completeness correlate with downstream usability (e.g., successful import into modeling tools or maintainability). This assumption is load-bearing for the broader conclusion about cost-effective workflows.
Authors: The referee correctly notes that we do not present direct empirical evidence linking our metrics to downstream usability outcomes such as tool import success or long-term maintainability. Syntactic validity and semantic completeness were chosen because they are necessary prerequisites for any practical MDE application, and expert feedback provides a domain-informed proxy for completeness. However, we acknowledge the absence of explicit correlation studies. We will add a dedicated paragraph in the Discussion section that explicitly states this limitation, explains the rationale for the selected metrics, and identifies end-to-end usability evaluation as valuable future work. This revision addresses the concern transparently while preserving the scope of the current study. revision: partial
Circularity Check
No circularity: purely empirical evaluation without derivations or self-referential constructions
full rationale
The paper is an empirical evaluation study that measures syntactic validity, semantic completeness, and inter-model reference consistency of LLM-generated DSL models via automatic parsing and expert feedback on a fixed set of test cases. No equations, fitted parameters, predictions, or derivations are present. The brief reference to extending prior work is purely contextual and does not serve as load-bearing justification for the central claims, which rest on direct experimental measurements rather than any self-citation chain or definitional reduction. The findings are therefore self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Few-shot prompting without fine-tuning is sufficient for open LLMs to produce DSL-conformant models that meet syntactic, semantic, and consistency requirements.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
structured evaluation combines automatic parsing and expert feedback across 39 LLMs, revealing that several compact models (e.g., gemma3:12b, mistral:7b-instruct) approach or match the quality of much larger models
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
syntactic validity, semantic completeness, and inter-model reference consistency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
Jules White, Quchen Fu, Sam Hays, et al. A prompt pattern catalog to enhance prompt engineering with chatgpt.arXiv preprint:2302.11382, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Chatgpt in the loop: A natural language extension for domain-specific modeling languages
Daniel Busch, Gerrit Nolte, Alexander Bainczyk, and Bernhard Steffen. Chatgpt in the loop: A natural language extension for domain-specific modeling languages. InBridging the Gap between AI and Reality, pages 375–390. Springer, 2023
work page 2023
-
[3]
Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
A light-weight low-code platform for back-end automation
Nicolas Hili and Raquel Araujo de Oliveira. A light-weight low-code platform for back-end automation. InMODELS ’22 Companion, pages 837–846. ACM, 2022
work page 2022
-
[5]
Turning low-code development platforms into true no-code with llms
Nathan Hagel, Nicolas Hili, and Didier Schwab. Turning low-code development platforms into true no-code with llms. InMODELS Companion ’24, 2024
work page 2024
-
[6]
Albert Q. Jiang et al. Mistral 7b.arXiv preprint:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yong Zhuang, Zi Lin, Zheng Li, Dacheng Li, Eric P Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint :2306.05685, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Adversarial demonstration attacks on large language models,
Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng Zheng, Zhuofeng Wu, Muhao Chen, and Chaowei Xiao. Adversarial demonstration attacks on large language models.arXiv preprint :2305.14950, 2023
-
[9]
Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint:2309.17167, 2024. 11 A preprint - May 18, 2026
-
[10]
Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, and Jiaya Jia. Dynamicbench: Evaluating real-time report generation in large language models.arXiv preprint :2506.21343, 2025
-
[11]
Zerosumeval: Scaling llm evaluation with inter-model competition.arXiv preprint :2504.12562, 2025
Haidar Khan, Hisham A Alyahya, Yazeed Alnumay, M Saiful Bari, and Bülent Yener. Zerosumeval: Scaling llm evaluation with inter-model competition.arXiv preprint :2504.12562, 2025
-
[12]
Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, and Nadia Polikarpova. HYSYNTH: Context-free LLM approximation for guiding program synthesis.arXiv preprint :2405.15880, 2024
-
[13]
Luaces, and Daniel Garcia-Gonzalez
Victor Lamas, Miguel R. Luaces, and Daniel Garcia-Gonzalez. DSLXpert: Llm-driven generic dsl code generation. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, 2024
work page 2024
-
[14]
Kajal: Extracting grammar of a source code using large language models
Mohammad Jalili Torkamani. Kajal: Extracting grammar of a source code using large language models. arXiv preprint :2412.08842, 2024
-
[15]
Andrew D. White, Glen M. Hocky, Heta A. Gandhi, Mehrad Ansari, Sam Cox, Geemi P. Wellawatte, Subarna Sasmal, Ziyue Yang, Kangxin Liu, Yuvraj Singh, and Willmor J. Peña Ccoa. Assessment of chemistry knowledge in large language models that generate code.Digital Discovery, 2(2), 2023
work page 2023
-
[16]
Neuro-Symbolic Program Synthesis
Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis.arXiv preprint :1611.01855, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[17]
Llm-supported natural language to bash translation.arXiv preprint :2502.06858, 2025
Finnian Westenfelder, Erik Hemberg, Miguel Tulla, Stephen Moskal, Una-May O’Reilly, and Silviu Chiricescu. Llm-supported natural language to bash translation.arXiv preprint :2502.06858, 2025
-
[18]
Langbite: An open-source platform to automate bias testing of large language models.SoftwareX, 2025
Sergio Morales, Robert Clarisó, and Jordi Cabot. Langbite: An open-source platform to automate bias testing of large language models.SoftwareX, 2025
work page 2025
-
[19]
Kyla H. Levin, Kyle Gwilt, Emery D. Berger, and Stephen N. Freund. Effective llm-driven code generation with pythoness.arXiv preprint :2501.02138, 2025
-
[20]
Granite: Enterprise-ready foundation models.https://www.ibm.com/granite, 2024
IBM. Granite: Enterprise-ready foundation models.https://www.ibm.com/granite, 2024
work page 2024
-
[21]
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld et al. Olmo: Accelerating the science of language models.arXiv preprint:2402.00838, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Dolphin 3.0 r1 mistral 24b, 2025
Cognitive Computations. Dolphin 3.0 r1 mistral 24b, 2025
work page 2025
-
[23]
TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint :2401.02385, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Code Llama: Open Foundation Models for Code
Baptiste Rozière et al. Code llama: Open foundation models for code.arXiv preprint:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Stable code technical report.arXiv preprint:2404.01226, 2024
Nikhil Pinnaparaju et al. Stable code technical report.arXiv preprint:2404.01226, 2024
-
[26]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha et al. Openthoughts: Data recipes for reasoning models.arXiv preprint: 2506.04178, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Qwen2.5-Coder Technical Report
Binyuan Hui et al. Qwen2.5-coder technical report.arXiv preprint :2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Qwq: Reflect deeply on the boundaries of the unknown, 2024.https://qwenlm.github
Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024.https://qwenlm.github. io/blog/qwq-32b-preview/
work page 2024
-
[29]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint:2404.14219, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Marah Abdin et al. Phi-4 technical report.arXiv preprint:2412.08905, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team. Gemma: Open models based on gemini research and technology.arXiv preprint :2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Gemma Team. Gemma 3 technical report.arXiv preprint:2503.19786, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.