pith. sign in

arxiv: 2605.15865 · v1 · pith:GGHTTB5Wnew · submitted 2026-05-15 · 💻 cs.SE

From Text to DSL: Evaluating Grammar-Based Model Generation Using Open LLMs

Pith reviewed 2026-05-20 16:33 UTC · model grok-4.3

classification 💻 cs.SE
keywords LLM evaluationDSL generationmodel-driven engineeringfew-shot promptingopen-source modelsgrammar conformanceUI and data modelssyntactic validity
0
0 comments X

The pith

Open LLMs as small as 7 billion parameters generate valid DSL models from natural language descriptions using few-shot prompting alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether open-source language models of sizes from 0.5B to 32B parameters can produce models that follow the rules of a domain-specific language when given only a few examples in a prompt. It moves beyond earlier tests on fixed data schemas to the harder case of creating both user interface models and data models from scratch, requiring the models to figure out domain relationships and keep the two models consistent with each other. This matters because successful small models would let teams run model-driven engineering tasks locally and cheaply instead of depending on large proprietary systems. The evaluation uses automatic grammar checks plus expert review on outputs from 39 models and finds that certain compact ones reach quality levels close to those of much bigger models on syntactic correctness, semantic completeness, and cross-model consistency.

Core claim

The authors demonstrate that open LLMs can generate DSL-conformant models from natural language using only few-shot prompting and no fine-tuning. By requiring the models to create both UI and data models entirely from scratch, the work tests their capacity to infer domain-specific relationships and preserve consistency across interconnected artifacts. Structured evaluation through parsing and expert feedback across 39 models shows that several compact models, such as gemma3:12b and mistral:7b-instruct, approach or match the performance of much larger models on the metrics of syntactic validity, semantic completeness, and inter-model reference consistency.

What carries the argument

Few-shot prompting applied to open LLMs to produce grammar-conformant, mutually consistent UI and data models evaluated on syntactic validity, semantic completeness, and reference consistency.

If this is right

  • Teams can adopt smaller open models for DSL generation tasks without incurring the cost or latency of large proprietary models.
  • Model-driven engineering workflows become feasible in environments where local deployment and data privacy are required.
  • The same prompting approach generalizes across models that play different structural roles, such as UI versus data models.
  • No additional training is needed to achieve grammar-conformant output when a modest number of examples is supplied in the prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results open the possibility of embedding DSL generation directly into lightweight development tools that run on ordinary hardware.
  • Similar prompting strategies could be tested on other grammar-constrained generation tasks outside traditional MDE, such as configuration files or API schemas.
  • Future experiments might vary the number of examples or the complexity of the domain to map the point at which model size stops being the dominant factor.

Load-bearing premise

The chosen metrics of syntactic validity, semantic completeness, and inter-model reference consistency, together with the selected test cases and expert feedback, are sufficient to show practical utility for real-world model-driven engineering tasks.

What would settle it

A new test set of domain descriptions in which compact models such as mistral:7b-instruct repeatedly produce outputs that fail automatic parsing or expert review for consistency between the generated UI and data models would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15865 by C\'ecilia Satrin, Didier Schwab, Junaid Baber, L\'eo Challier, Nicolas Hili.

Figure 1
Figure 1. Figure 1: Overview of the proposed model-based approach based on LLM for DSL creation. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Preview of LARK based parser implementation that identifies syntax errors with token level [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the human evaluation interface. Panel (a) displays the available evaluation experiments, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Stacked bar chart of averaged evaluation scores for 26 DSL models designed for an online ice cream [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Stacked bar chart of averaged evaluation scores for 18 DSL models with fewer than 8 billion [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have shown increasing potential in automating model-driven software engineering tasks, particularly in generating models conforming to Domain Specific Languages (DSLs) from natural language. While most existing approaches rely on large proprietary models, their high cost and limited deployability hinder broader adoption. In this paper, we evaluate whether open-source LLMs of varying sizes (0.5B to 32B parameters) can generate DSL-conformant models using only few-shot prompting, without any fine-tuning. Our evaluation focuses on key model-driven engineering (MDE) requirements, including syntactic validity, semantic completeness, and inter-model reference consistency. We extend our prior work by moving from generating user interface models (referred to as "UI models" in this paper) over fixed, predefined data schemas ("data models") to generating both the UI and data models entirely from scratch. This shift serves two purposes: first, it highlights the LLM's ability to infer domain-specific relationships and maintain consistency across multiple interconnected models; second, it allows us to generalize earlier findings by testing DSL generation across models of different natures and structural roles. Our structured evaluation combines automatic parsing and expert feedback across 39 LLMs, revealing that several compact models (e.g., \texttt{gemma3:12b}, \texttt{mistral:7b-instruct}) approach or match the quality of much larger models. These findings demonstrate the feasibility of using smaller, open-source LLMs for grammar-conformant DSL generation in MDE workflows, offering a cost-effective and deployable alternative to closed LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates 39 open-source LLMs (0.5B–32B parameters) on generating DSL-conformant models from natural language via few-shot prompting, without fine-tuning. It measures syntactic validity, semantic completeness, and inter-model reference consistency while extending prior work from UI models over fixed data schemas to generating both UI and data models from scratch. The central finding is that compact models such as gemma3:12b and mistral:7b-instruct approach or match the quality of much larger models according to automatic parsing plus expert feedback.

Significance. If the evaluation is robust, the result would be significant for model-driven engineering by showing that smaller, deployable open LLMs can produce grammar-conformant DSL models at quality levels comparable to larger models. This would support cost-effective alternatives to proprietary LLMs and broaden practical adoption in MDE workflows. The broad coverage across 39 models and the shift to generating interconnected models from scratch are positive aspects of the study design.

major comments (2)
  1. [Abstract] Abstract: the central claim that compact models approach or match larger ones rests on automatic parsing and expert feedback, yet the abstract supplies no quantitative breakdown (per-model scores, number of test cases, inter-rater reliability, or baseline comparisons). Without these details the strength of the parity result cannot be assessed.
  2. [Evaluation] The evaluation section (and associated tables/figures): the chosen metrics and fixed test cases are asserted to demonstrate practical utility for MDE, but no evidence is given that syntactic validity and semantic completeness correlate with downstream usability (e.g., successful import into modeling tools or maintainability). This assumption is load-bearing for the broader conclusion about cost-effective workflows.
minor comments (2)
  1. Clarify the exact prompting templates and the domain complexity of the test cases so readers can judge representativeness.
  2. Add a non-LLM baseline (e.g., template-based or rule-based generator) to contextualize the absolute performance levels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate revisions that will be incorporated to improve the clarity and robustness of our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that compact models approach or match larger ones rests on automatic parsing and expert feedback, yet the abstract supplies no quantitative breakdown (per-model scores, number of test cases, inter-rater reliability, or baseline comparisons). Without these details the strength of the parity result cannot be assessed.

    Authors: We agree that the abstract would benefit from additional quantitative details to allow readers to better evaluate the parity claim. In the revised manuscript, we will update the abstract to include the total number of test cases (across both UI and data model generations), summary performance figures such as syntactic validity percentages for the top compact models (e.g., gemma3:12b and mistral:7b-instruct) relative to larger models, and a brief reference to the expert evaluation process. If inter-rater reliability statistics were computed, they will be noted; otherwise we will clarify the expert review protocol. This change strengthens the abstract without misrepresenting the underlying results. revision: yes

  2. Referee: [Evaluation] The evaluation section (and associated tables/figures): the chosen metrics and fixed test cases are asserted to demonstrate practical utility for MDE, but no evidence is given that syntactic validity and semantic completeness correlate with downstream usability (e.g., successful import into modeling tools or maintainability). This assumption is load-bearing for the broader conclusion about cost-effective workflows.

    Authors: The referee correctly notes that we do not present direct empirical evidence linking our metrics to downstream usability outcomes such as tool import success or long-term maintainability. Syntactic validity and semantic completeness were chosen because they are necessary prerequisites for any practical MDE application, and expert feedback provides a domain-informed proxy for completeness. However, we acknowledge the absence of explicit correlation studies. We will add a dedicated paragraph in the Discussion section that explicitly states this limitation, explains the rationale for the selected metrics, and identifies end-to-end usability evaluation as valuable future work. This revision addresses the concern transparently while preserving the scope of the current study. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation without derivations or self-referential constructions

full rationale

The paper is an empirical evaluation study that measures syntactic validity, semantic completeness, and inter-model reference consistency of LLM-generated DSL models via automatic parsing and expert feedback on a fixed set of test cases. No equations, fitted parameters, predictions, or derivations are present. The brief reference to extending prior work is purely contextual and does not serve as load-bearing justification for the central claims, which rest on direct experimental measurements rather than any self-citation chain or definitional reduction. The findings are therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an empirical evaluation there are no free parameters, invented entities, or mathematical axioms. The work rests on the domain assumption that few-shot prompting plus standard parsing and expert review can measure LLM capability for DSL generation.

axioms (1)
  • domain assumption Few-shot prompting without fine-tuning is sufficient for open LLMs to produce DSL-conformant models that meet syntactic, semantic, and consistency requirements.
    This premise underpins the entire experimental design described in the abstract.

pith-pipeline@v0.9.0 · 5835 in / 1165 out tokens · 85912 ms · 2026-05-20T16:33:07.914831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 15 internal anchors

  1. [1]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    Jules White, Quchen Fu, Sam Hays, et al. A prompt pattern catalog to enhance prompt engineering with chatgpt.arXiv preprint:2302.11382, 2023

  2. [2]

    Chatgpt in the loop: A natural language extension for domain-specific modeling languages

    Daniel Busch, Gerrit Nolte, Alexander Bainczyk, and Bernhard Steffen. Chatgpt in the loop: A natural language extension for domain-specific modeling languages. InBridging the Gap between AI and Reality, pages 375–390. Springer, 2023

  3. [3]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint:2407.21783, 2024

  4. [4]

    A light-weight low-code platform for back-end automation

    Nicolas Hili and Raquel Araujo de Oliveira. A light-weight low-code platform for back-end automation. InMODELS ’22 Companion, pages 837–846. ACM, 2022

  5. [5]

    Turning low-code development platforms into true no-code with llms

    Nathan Hagel, Nicolas Hili, and Didier Schwab. Turning low-code development platforms into true no-code with llms. InMODELS Companion ’24, 2024

  6. [6]

    Mistral 7B

    Albert Q. Jiang et al. Mistral 7b.arXiv preprint:2310.06825, 2023

  7. [7]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yong Zhuang, Zi Lin, Zheng Li, Dacheng Li, Eric P Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint :2306.05685, 2023

  8. [8]

    Adversarial demonstration attacks on large language models,

    Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng Zheng, Zhuofeng Wu, Muhao Chen, and Chaowei Xiao. Adversarial demonstration attacks on large language models.arXiv preprint :2305.14950, 2023

  9. [9]

    Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

    Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint:2309.17167, 2024. 11 A preprint - May 18, 2026

  10. [10]

    Dynamicbench: Evaluating real-time report generation in large language models.arXiv preprint :2506.21343, 2025

    Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, and Jiaya Jia. Dynamicbench: Evaluating real-time report generation in large language models.arXiv preprint :2506.21343, 2025

  11. [11]

    Zerosumeval: Scaling llm evaluation with inter-model competition.arXiv preprint :2504.12562, 2025

    Haidar Khan, Hisham A Alyahya, Yazeed Alnumay, M Saiful Bari, and Bülent Yener. Zerosumeval: Scaling llm evaluation with inter-model competition.arXiv preprint :2504.12562, 2025

  12. [12]

    HYSYNTH: Context-free LLM approximation for guiding program synthesis.arXiv preprint :2405.15880, 2024

    Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, and Nadia Polikarpova. HYSYNTH: Context-free LLM approximation for guiding program synthesis.arXiv preprint :2405.15880, 2024

  13. [13]

    Luaces, and Daniel Garcia-Gonzalez

    Victor Lamas, Miguel R. Luaces, and Daniel Garcia-Gonzalez. DSLXpert: Llm-driven generic dsl code generation. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, 2024

  14. [14]

    Kajal: Extracting grammar of a source code using large language models

    Mohammad Jalili Torkamani. Kajal: Extracting grammar of a source code using large language models. arXiv preprint :2412.08842, 2024

  15. [15]

    White, Glen M

    Andrew D. White, Glen M. Hocky, Heta A. Gandhi, Mehrad Ansari, Sam Cox, Geemi P. Wellawatte, Subarna Sasmal, Ziyue Yang, Kangxin Liu, Yuvraj Singh, and Willmor J. Peña Ccoa. Assessment of chemistry knowledge in large language models that generate code.Digital Discovery, 2(2), 2023

  16. [16]

    Neuro-Symbolic Program Synthesis

    Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis.arXiv preprint :1611.01855, 2016

  17. [17]

    Llm-supported natural language to bash translation.arXiv preprint :2502.06858, 2025

    Finnian Westenfelder, Erik Hemberg, Miguel Tulla, Stephen Moskal, Una-May O’Reilly, and Silviu Chiricescu. Llm-supported natural language to bash translation.arXiv preprint :2502.06858, 2025

  18. [18]

    Langbite: An open-source platform to automate bias testing of large language models.SoftwareX, 2025

    Sergio Morales, Robert Clarisó, and Jordi Cabot. Langbite: An open-source platform to automate bias testing of large language models.SoftwareX, 2025

  19. [19]

    Levin, Kyle Gwilt, Emery D

    Kyla H. Levin, Kyle Gwilt, Emery D. Berger, and Stephen N. Freund. Effective llm-driven code generation with pythoness.arXiv preprint :2501.02138, 2025

  20. [20]

    Granite: Enterprise-ready foundation models.https://www.ibm.com/granite, 2024

    IBM. Granite: Enterprise-ready foundation models.https://www.ibm.com/granite, 2024

  21. [21]

    OLMo: Accelerating the Science of Language Models

    Dirk Groeneveld et al. Olmo: Accelerating the science of language models.arXiv preprint:2402.00838, 2024

  22. [22]

    Dolphin 3.0 r1 mistral 24b, 2025

    Cognitive Computations. Dolphin 3.0 r1 mistral 24b, 2025

  23. [23]

    TinyLlama: An Open-Source Small Language Model

    Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint :2401.02385, 2024

  24. [24]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière et al. Code llama: Open foundation models for code.arXiv preprint:2308.12950, 2023

  25. [25]

    Stable code technical report.arXiv preprint:2404.01226, 2024

    Nikhil Pinnaparaju et al. Stable code technical report.arXiv preprint:2404.01226, 2024

  26. [26]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha et al. Openthoughts: Data recipes for reasoning models.arXiv preprint: 2506.04178, 2025

  27. [27]

    Qwen2.5-Coder Technical Report

    Binyuan Hui et al. Qwen2.5-coder technical report.arXiv preprint :2409.12186, 2024

  28. [28]

    Qwq: Reflect deeply on the boundaries of the unknown, 2024.https://qwenlm.github

    Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024.https://qwenlm.github. io/blog/qwq-32b-preview/

  29. [29]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint:2501.12948, 2025

  30. [30]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint:2404.14219, 2024

  31. [31]

    Phi-4 Technical Report

    Marah Abdin et al. Phi-4 technical report.arXiv preprint:2412.08905, 2025

  32. [32]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team. Gemma: Open models based on gemini research and technology.arXiv preprint :2403.08295, 2024

  33. [33]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint:2503.19786, 2025. 12