From Text to DSL: Evaluating Grammar-Based Model Generation Using Open LLMs

C\'ecilia Satrin; Didier Schwab; Junaid Baber; L\'eo Challier; Nicolas Hili

arxiv: 2605.15865 · v1 · pith:GGHTTB5Wnew · submitted 2026-05-15 · 💻 cs.SE

From Text to DSL: Evaluating Grammar-Based Model Generation Using Open LLMs

Junaid Baber , Nicolas Hili , Didier Schwab , L\'eo Challier , C\'ecilia Satrin This is my paper

Pith reviewed 2026-05-20 16:33 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM evaluationDSL generationmodel-driven engineeringfew-shot promptingopen-source modelsgrammar conformanceUI and data modelssyntactic validity

0 comments

The pith

Open LLMs as small as 7 billion parameters generate valid DSL models from natural language descriptions using few-shot prompting alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether open-source language models of sizes from 0.5B to 32B parameters can produce models that follow the rules of a domain-specific language when given only a few examples in a prompt. It moves beyond earlier tests on fixed data schemas to the harder case of creating both user interface models and data models from scratch, requiring the models to figure out domain relationships and keep the two models consistent with each other. This matters because successful small models would let teams run model-driven engineering tasks locally and cheaply instead of depending on large proprietary systems. The evaluation uses automatic grammar checks plus expert review on outputs from 39 models and finds that certain compact ones reach quality levels close to those of much bigger models on syntactic correctness, semantic completeness, and cross-model consistency.

Core claim

The authors demonstrate that open LLMs can generate DSL-conformant models from natural language using only few-shot prompting and no fine-tuning. By requiring the models to create both UI and data models entirely from scratch, the work tests their capacity to infer domain-specific relationships and preserve consistency across interconnected artifacts. Structured evaluation through parsing and expert feedback across 39 models shows that several compact models, such as gemma3:12b and mistral:7b-instruct, approach or match the performance of much larger models on the metrics of syntactic validity, semantic completeness, and inter-model reference consistency.

What carries the argument

Few-shot prompting applied to open LLMs to produce grammar-conformant, mutually consistent UI and data models evaluated on syntactic validity, semantic completeness, and reference consistency.

If this is right

Teams can adopt smaller open models for DSL generation tasks without incurring the cost or latency of large proprietary models.
Model-driven engineering workflows become feasible in environments where local deployment and data privacy are required.
The same prompting approach generalizes across models that play different structural roles, such as UI versus data models.
No additional training is needed to achieve grammar-conformant output when a modest number of examples is supplied in the prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results open the possibility of embedding DSL generation directly into lightweight development tools that run on ordinary hardware.
Similar prompting strategies could be tested on other grammar-constrained generation tasks outside traditional MDE, such as configuration files or API schemas.
Future experiments might vary the number of examples or the complexity of the domain to map the point at which model size stops being the dominant factor.

Load-bearing premise

The chosen metrics of syntactic validity, semantic completeness, and inter-model reference consistency, together with the selected test cases and expert feedback, are sufficient to show practical utility for real-world model-driven engineering tasks.

What would settle it

A new test set of domain descriptions in which compact models such as mistral:7b-instruct repeatedly produce outputs that fail automatic parsing or expert review for consistency between the generated UI and data models would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.15865 by C\'ecilia Satrin, Didier Schwab, Junaid Baber, L\'eo Challier, Nicolas Hili.

**Figure 2.** Figure 2: Preview of LARK based parser implementation that identifies syntax errors with token level [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the human evaluation interface. Panel (a) displays the available evaluation experiments, [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Stacked bar chart of averaged evaluation scores for 26 DSL models designed for an online ice cream [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Stacked bar chart of averaged evaluation scores for 18 DSL models with fewer than 8 billion [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have shown increasing potential in automating model-driven software engineering tasks, particularly in generating models conforming to Domain Specific Languages (DSLs) from natural language. While most existing approaches rely on large proprietary models, their high cost and limited deployability hinder broader adoption. In this paper, we evaluate whether open-source LLMs of varying sizes (0.5B to 32B parameters) can generate DSL-conformant models using only few-shot prompting, without any fine-tuning. Our evaluation focuses on key model-driven engineering (MDE) requirements, including syntactic validity, semantic completeness, and inter-model reference consistency. We extend our prior work by moving from generating user interface models (referred to as "UI models" in this paper) over fixed, predefined data schemas ("data models") to generating both the UI and data models entirely from scratch. This shift serves two purposes: first, it highlights the LLM's ability to infer domain-specific relationships and maintain consistency across multiple interconnected models; second, it allows us to generalize earlier findings by testing DSL generation across models of different natures and structural roles. Our structured evaluation combines automatic parsing and expert feedback across 39 LLMs, revealing that several compact models (e.g., \texttt{gemma3:12b}, \texttt{mistral:7b-instruct}) approach or match the quality of much larger models. These findings demonstrate the feasibility of using smaller, open-source LLMs for grammar-conformant DSL generation in MDE workflows, offering a cost-effective and deployable alternative to closed LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Compact open LLMs can generate consistent DSL models for UI and data from scratch in the authors' tests, but missing experimental details make it hard to judge how far that extends to real MDE work.

read the letter

The key point is that this evaluation finds several smaller open models, such as gemma3:12b and mistral:7b-instruct, producing DSL outputs that match or come close to larger ones on syntactic validity, semantic completeness, and cross-model consistency when using few-shot prompts only. The work extends the authors' earlier results by dropping the fixed data schema and instead letting the LLM generate both the UI model and the supporting data model from natural language descriptions. That change tests whether the model can infer domain relationships and keep the two models aligned, which is a reasonable next step for MDE applications. The setup itself is straightforward: automatic parsing for grammar conformance plus expert review on a set of cases, run across 39 models ranging from 0.5B to 32B parameters. This gives a practical comparison focused on open, deployable options rather than proprietary ones, which directly addresses cost and accessibility concerns in the field. The findings are new in the sense that they report results for this from-scratch dual-model generation task, which was not covered in the prior literature they cite. The main limitation is the thin reporting. The abstract gives no per-model score tables, no count of test cases, no inter-rater numbers for the expert feedback, and no comparison against a simple non-LLM baseline. Without those, it is difficult to tell whether the observed parity holds only for the chosen, relatively contained scenarios or would survive more complex domain models. The stress-test concern about metric-to-usability correlation is fair here; syntactic and semantic scores are necessary but do not automatically show that the generated models would integrate cleanly into existing modeling tools or stay maintainable. This paper is aimed at researchers working on LLM support for model-driven engineering who care about open-source alternatives. A reader already familiar with prompting techniques for code or model generation will get the most out of it. The empirical framing and the extension to dual-model generation are solid enough to warrant a full review, even if the current draft needs clearer methods and results sections. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper evaluates 39 open-source LLMs (0.5B–32B parameters) on generating DSL-conformant models from natural language via few-shot prompting, without fine-tuning. It measures syntactic validity, semantic completeness, and inter-model reference consistency while extending prior work from UI models over fixed data schemas to generating both UI and data models from scratch. The central finding is that compact models such as gemma3:12b and mistral:7b-instruct approach or match the quality of much larger models according to automatic parsing plus expert feedback.

Significance. If the evaluation is robust, the result would be significant for model-driven engineering by showing that smaller, deployable open LLMs can produce grammar-conformant DSL models at quality levels comparable to larger models. This would support cost-effective alternatives to proprietary LLMs and broaden practical adoption in MDE workflows. The broad coverage across 39 models and the shift to generating interconnected models from scratch are positive aspects of the study design.

major comments (2)

[Abstract] Abstract: the central claim that compact models approach or match larger ones rests on automatic parsing and expert feedback, yet the abstract supplies no quantitative breakdown (per-model scores, number of test cases, inter-rater reliability, or baseline comparisons). Without these details the strength of the parity result cannot be assessed.
[Evaluation] The evaluation section (and associated tables/figures): the chosen metrics and fixed test cases are asserted to demonstrate practical utility for MDE, but no evidence is given that syntactic validity and semantic completeness correlate with downstream usability (e.g., successful import into modeling tools or maintainability). This assumption is load-bearing for the broader conclusion about cost-effective workflows.

minor comments (2)

Clarify the exact prompting templates and the domain complexity of the test cases so readers can judge representativeness.
Add a non-LLM baseline (e.g., template-based or rule-based generator) to contextualize the absolute performance levels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and indicate revisions that will be incorporated to improve the clarity and robustness of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that compact models approach or match larger ones rests on automatic parsing and expert feedback, yet the abstract supplies no quantitative breakdown (per-model scores, number of test cases, inter-rater reliability, or baseline comparisons). Without these details the strength of the parity result cannot be assessed.

Authors: We agree that the abstract would benefit from additional quantitative details to allow readers to better evaluate the parity claim. In the revised manuscript, we will update the abstract to include the total number of test cases (across both UI and data model generations), summary performance figures such as syntactic validity percentages for the top compact models (e.g., gemma3:12b and mistral:7b-instruct) relative to larger models, and a brief reference to the expert evaluation process. If inter-rater reliability statistics were computed, they will be noted; otherwise we will clarify the expert review protocol. This change strengthens the abstract without misrepresenting the underlying results. revision: yes
Referee: [Evaluation] The evaluation section (and associated tables/figures): the chosen metrics and fixed test cases are asserted to demonstrate practical utility for MDE, but no evidence is given that syntactic validity and semantic completeness correlate with downstream usability (e.g., successful import into modeling tools or maintainability). This assumption is load-bearing for the broader conclusion about cost-effective workflows.

Authors: The referee correctly notes that we do not present direct empirical evidence linking our metrics to downstream usability outcomes such as tool import success or long-term maintainability. Syntactic validity and semantic completeness were chosen because they are necessary prerequisites for any practical MDE application, and expert feedback provides a domain-informed proxy for completeness. However, we acknowledge the absence of explicit correlation studies. We will add a dedicated paragraph in the Discussion section that explicitly states this limitation, explains the rationale for the selected metrics, and identifies end-to-end usability evaluation as valuable future work. This revision addresses the concern transparently while preserving the scope of the current study. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation without derivations or self-referential constructions

full rationale

The paper is an empirical evaluation study that measures syntactic validity, semantic completeness, and inter-model reference consistency of LLM-generated DSL models via automatic parsing and expert feedback on a fixed set of test cases. No equations, fitted parameters, predictions, or derivations are present. The brief reference to extending prior work is purely contextual and does not serve as load-bearing justification for the central claims, which rest on direct experimental measurements rather than any self-citation chain or definitional reduction. The findings are therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As an empirical evaluation there are no free parameters, invented entities, or mathematical axioms. The work rests on the domain assumption that few-shot prompting plus standard parsing and expert review can measure LLM capability for DSL generation.

axioms (1)

domain assumption Few-shot prompting without fine-tuning is sufficient for open LLMs to produce DSL-conformant models that meet syntactic, semantic, and consistency requirements.
This premise underpins the entire experimental design described in the abstract.

pith-pipeline@v0.9.0 · 5835 in / 1165 out tokens · 85912 ms · 2026-05-20T16:33:07.914831+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

structured evaluation combines automatic parsing and expert feedback across 39 LLMs, revealing that several compact models (e.g., gemma3:12b, mistral:7b-instruct) approach or match the quality of much larger models
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

syntactic validity, semantic completeness, and inter-model reference consistency

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 15 internal anchors

[1]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

Jules White, Quchen Fu, Sam Hays, et al. A prompt pattern catalog to enhance prompt engineering with chatgpt.arXiv preprint:2302.11382, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Chatgpt in the loop: A natural language extension for domain-specific modeling languages

Daniel Busch, Gerrit Nolte, Alexander Bainczyk, and Bernhard Steffen. Chatgpt in the loop: A natural language extension for domain-specific modeling languages. InBridging the Gap between AI and Reality, pages 375–390. Springer, 2023

work page 2023
[3]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

A light-weight low-code platform for back-end automation

Nicolas Hili and Raquel Araujo de Oliveira. A light-weight low-code platform for back-end automation. InMODELS ’22 Companion, pages 837–846. ACM, 2022

work page 2022
[5]

Turning low-code development platforms into true no-code with llms

Nathan Hagel, Nicolas Hili, and Didier Schwab. Turning low-code development platforms into true no-code with llms. InMODELS Companion ’24, 2024

work page 2024
[6]

Mistral 7B

Albert Q. Jiang et al. Mistral 7b.arXiv preprint:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yong Zhuang, Zi Lin, Zheng Li, Dacheng Li, Eric P Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint :2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Adversarial demonstration attacks on large language models,

Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng Zheng, Zhuofeng Wu, Muhao Chen, and Chaowei Xiao. Adversarial demonstration attacks on large language models.arXiv preprint :2305.14950, 2023

work page arXiv 2023
[9]

Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint:2309.17167, 2024. 11 A preprint - May 18, 2026

work page arXiv 2024
[10]

Dynamicbench: Evaluating real-time report generation in large language models.arXiv preprint :2506.21343, 2025

Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, and Jiaya Jia. Dynamicbench: Evaluating real-time report generation in large language models.arXiv preprint :2506.21343, 2025

work page arXiv 2025
[11]

Zerosumeval: Scaling llm evaluation with inter-model competition.arXiv preprint :2504.12562, 2025

Haidar Khan, Hisham A Alyahya, Yazeed Alnumay, M Saiful Bari, and Bülent Yener. Zerosumeval: Scaling llm evaluation with inter-model competition.arXiv preprint :2504.12562, 2025

work page arXiv 2025
[12]

HYSYNTH: Context-free LLM approximation for guiding program synthesis.arXiv preprint :2405.15880, 2024

Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, and Nadia Polikarpova. HYSYNTH: Context-free LLM approximation for guiding program synthesis.arXiv preprint :2405.15880, 2024

work page arXiv 2024
[13]

Luaces, and Daniel Garcia-Gonzalez

Victor Lamas, Miguel R. Luaces, and Daniel Garcia-Gonzalez. DSLXpert: Llm-driven generic dsl code generation. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, 2024

work page 2024
[14]

Kajal: Extracting grammar of a source code using large language models

Mohammad Jalili Torkamani. Kajal: Extracting grammar of a source code using large language models. arXiv preprint :2412.08842, 2024

work page arXiv 2024
[15]

White, Glen M

Andrew D. White, Glen M. Hocky, Heta A. Gandhi, Mehrad Ansari, Sam Cox, Geemi P. Wellawatte, Subarna Sasmal, Ziyue Yang, Kangxin Liu, Yuvraj Singh, and Willmor J. Peña Ccoa. Assessment of chemistry knowledge in large language models that generate code.Digital Discovery, 2(2), 2023

work page 2023
[16]

Neuro-Symbolic Program Synthesis

Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis.arXiv preprint :1611.01855, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Llm-supported natural language to bash translation.arXiv preprint :2502.06858, 2025

Finnian Westenfelder, Erik Hemberg, Miguel Tulla, Stephen Moskal, Una-May O’Reilly, and Silviu Chiricescu. Llm-supported natural language to bash translation.arXiv preprint :2502.06858, 2025

work page arXiv 2025
[18]

Langbite: An open-source platform to automate bias testing of large language models.SoftwareX, 2025

Sergio Morales, Robert Clarisó, and Jordi Cabot. Langbite: An open-source platform to automate bias testing of large language models.SoftwareX, 2025

work page 2025
[19]

Levin, Kyle Gwilt, Emery D

Kyla H. Levin, Kyle Gwilt, Emery D. Berger, and Stephen N. Freund. Effective llm-driven code generation with pythoness.arXiv preprint :2501.02138, 2025

work page arXiv 2025
[20]

Granite: Enterprise-ready foundation models.https://www.ibm.com/granite, 2024

IBM. Granite: Enterprise-ready foundation models.https://www.ibm.com/granite, 2024

work page 2024
[21]

OLMo: Accelerating the Science of Language Models

Dirk Groeneveld et al. Olmo: Accelerating the science of language models.arXiv preprint:2402.00838, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Dolphin 3.0 r1 mistral 24b, 2025

Cognitive Computations. Dolphin 3.0 r1 mistral 24b, 2025

work page 2025
[23]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint :2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Code Llama: Open Foundation Models for Code

Baptiste Rozière et al. Code llama: Open foundation models for code.arXiv preprint:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Stable code technical report.arXiv preprint:2404.01226, 2024

Nikhil Pinnaparaju et al. Stable code technical report.arXiv preprint:2404.01226, 2024

work page arXiv 2024
[26]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha et al. Openthoughts: Data recipes for reasoning models.arXiv preprint: 2506.04178, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Qwen2.5-Coder Technical Report

Binyuan Hui et al. Qwen2.5-coder technical report.arXiv preprint :2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Qwq: Reflect deeply on the boundaries of the unknown, 2024.https://qwenlm.github

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024.https://qwenlm.github. io/blog/qwq-32b-preview/

work page 2024
[29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Phi-4 Technical Report

Marah Abdin et al. Phi-4 technical report.arXiv preprint:2412.08905, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology.arXiv preprint :2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint:2503.19786, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

Jules White, Quchen Fu, Sam Hays, et al. A prompt pattern catalog to enhance prompt engineering with chatgpt.arXiv preprint:2302.11382, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Chatgpt in the loop: A natural language extension for domain-specific modeling languages

Daniel Busch, Gerrit Nolte, Alexander Bainczyk, and Bernhard Steffen. Chatgpt in the loop: A natural language extension for domain-specific modeling languages. InBridging the Gap between AI and Reality, pages 375–390. Springer, 2023

work page 2023

[3] [3]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

A light-weight low-code platform for back-end automation

Nicolas Hili and Raquel Araujo de Oliveira. A light-weight low-code platform for back-end automation. InMODELS ’22 Companion, pages 837–846. ACM, 2022

work page 2022

[5] [5]

Turning low-code development platforms into true no-code with llms

Nathan Hagel, Nicolas Hili, and Didier Schwab. Turning low-code development platforms into true no-code with llms. InMODELS Companion ’24, 2024

work page 2024

[6] [6]

Mistral 7B

Albert Q. Jiang et al. Mistral 7b.arXiv preprint:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yong Zhuang, Zi Lin, Zheng Li, Dacheng Li, Eric P Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint :2306.05685, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Adversarial demonstration attacks on large language models,

Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng Zheng, Zhuofeng Wu, Muhao Chen, and Chaowei Xiao. Adversarial demonstration attacks on large language models.arXiv preprint :2305.14950, 2023

work page arXiv 2023

[9] [9]

Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint arXiv:2309.17167, 2023

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang, and Xing Xie. Dyval: Dynamic evaluation of large language models for reasoning tasks.arXiv preprint:2309.17167, 2024. 11 A preprint - May 18, 2026

work page arXiv 2024

[10] [10]

Dynamicbench: Evaluating real-time report generation in large language models.arXiv preprint :2506.21343, 2025

Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, and Jiaya Jia. Dynamicbench: Evaluating real-time report generation in large language models.arXiv preprint :2506.21343, 2025

work page arXiv 2025

[11] [11]

Zerosumeval: Scaling llm evaluation with inter-model competition.arXiv preprint :2504.12562, 2025

Haidar Khan, Hisham A Alyahya, Yazeed Alnumay, M Saiful Bari, and Bülent Yener. Zerosumeval: Scaling llm evaluation with inter-model competition.arXiv preprint :2504.12562, 2025

work page arXiv 2025

[12] [12]

HYSYNTH: Context-free LLM approximation for guiding program synthesis.arXiv preprint :2405.15880, 2024

Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, and Nadia Polikarpova. HYSYNTH: Context-free LLM approximation for guiding program synthesis.arXiv preprint :2405.15880, 2024

work page arXiv 2024

[13] [13]

Luaces, and Daniel Garcia-Gonzalez

Victor Lamas, Miguel R. Luaces, and Daniel Garcia-Gonzalez. DSLXpert: Llm-driven generic dsl code generation. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems, 2024

work page 2024

[14] [14]

Kajal: Extracting grammar of a source code using large language models

Mohammad Jalili Torkamani. Kajal: Extracting grammar of a source code using large language models. arXiv preprint :2412.08842, 2024

work page arXiv 2024

[15] [15]

White, Glen M

Andrew D. White, Glen M. Hocky, Heta A. Gandhi, Mehrad Ansari, Sam Cox, Geemi P. Wellawatte, Subarna Sasmal, Ziyue Yang, Kangxin Liu, Yuvraj Singh, and Willmor J. Peña Ccoa. Assessment of chemistry knowledge in large language models that generate code.Digital Discovery, 2(2), 2023

work page 2023

[16] [16]

Neuro-Symbolic Program Synthesis

Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. Neuro-symbolic program synthesis.arXiv preprint :1611.01855, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

Llm-supported natural language to bash translation.arXiv preprint :2502.06858, 2025

Finnian Westenfelder, Erik Hemberg, Miguel Tulla, Stephen Moskal, Una-May O’Reilly, and Silviu Chiricescu. Llm-supported natural language to bash translation.arXiv preprint :2502.06858, 2025

work page arXiv 2025

[18] [18]

Langbite: An open-source platform to automate bias testing of large language models.SoftwareX, 2025

Sergio Morales, Robert Clarisó, and Jordi Cabot. Langbite: An open-source platform to automate bias testing of large language models.SoftwareX, 2025

work page 2025

[19] [19]

Levin, Kyle Gwilt, Emery D

Kyla H. Levin, Kyle Gwilt, Emery D. Berger, and Stephen N. Freund. Effective llm-driven code generation with pythoness.arXiv preprint :2501.02138, 2025

work page arXiv 2025

[20] [20]

Granite: Enterprise-ready foundation models.https://www.ibm.com/granite, 2024

IBM. Granite: Enterprise-ready foundation models.https://www.ibm.com/granite, 2024

work page 2024

[21] [21]

OLMo: Accelerating the Science of Language Models

Dirk Groeneveld et al. Olmo: Accelerating the science of language models.arXiv preprint:2402.00838, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Dolphin 3.0 r1 mistral 24b, 2025

Cognitive Computations. Dolphin 3.0 r1 mistral 24b, 2025

work page 2025

[23] [23]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint :2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Code Llama: Open Foundation Models for Code

Baptiste Rozière et al. Code llama: Open foundation models for code.arXiv preprint:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Stable code technical report.arXiv preprint:2404.01226, 2024

Nikhil Pinnaparaju et al. Stable code technical report.arXiv preprint:2404.01226, 2024

work page arXiv 2024

[26] [26]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha et al. Openthoughts: Data recipes for reasoning models.arXiv preprint: 2506.04178, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Qwen2.5-Coder Technical Report

Binyuan Hui et al. Qwen2.5-coder technical report.arXiv preprint :2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Qwq: Reflect deeply on the boundaries of the unknown, 2024.https://qwenlm.github

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, 2024.https://qwenlm.github. io/blog/qwq-32b-preview/

work page 2024

[29] [29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint:2404.14219, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Phi-4 Technical Report

Marah Abdin et al. Phi-4 technical report.arXiv preprint:2412.08905, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on gemini research and technology.arXiv preprint :2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint:2503.19786, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025