Recognition: unknown
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Pith reviewed 2026-05-08 02:44 UTC · model grok-4.3
The pith
Fine-tuning code LLMs on path-preserving JSON encodings of DSL repositories produces multi-file outputs with exact structural fidelity of 1.00.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that encoding DSL folder hierarchies as structured, path-preserving JSON enables single-response generation of multi-file changes from natural-language instructions. When Qwen2.5-Coder and DeepSeek-Coder (7B) are adapted via QLoRA fine-tuning on such data, they reach high exact-match accuracy, substantial edit similarity, and repository structural fidelity of exactly 1.00 on the held-out set, while one-shot in-context learning yields smaller but consistent gains over baseline prompting.
What carries the argument
Encoding DSL folder hierarchies as structured, path-preserving JSON that lets a single model response produce complete multi-file repository outputs and learn cross-file dependencies.
If this is right
- One-shot in-context learning still improves accuracy over plain prompting across both models and all metrics.
- Outputs from the fine-tuned models pass an execution-based validation using the existing DSL-to-Java/TypeScript code generator.
- An expert developer survey confirms the practical usefulness of the generated multi-file artifacts.
- The dataset-construction and evaluation pipeline can be reused for other repository-scale DSL or configuration tasks.
- Structural fidelity of 1.00 indicates that the JSON representation successfully captures the folder and file layout required by the downstream generator.
Where Pith is reading between the lines
- The JSON encoding technique could be tested on non-DSL hierarchical codebases such as large configuration or build repositories.
- Fine-tuning on domain-specific industrial data may reduce the need for elaborate prompt engineering when targeting narrow languages.
- The custom structural-fidelity metric could serve as a reusable benchmark component for other multi-file code-generation studies.
- Integration of the same JSON format into IDE plugins might allow real-time generation of consistent multi-file edits during development.
Load-bearing premise
The held-out test set and custom metrics for edit correctness and structural fidelity adequately represent real industrial multi-file DSL tasks without bias introduced by how the dataset was built or represented as JSON.
What would settle it
A fresh collection of multi-file change requests drawn directly from ongoing BMW developer workflows, where the fine-tuned models produce outputs whose structural fidelity falls below 0.9 when checked against the actual repository structure.
Figures
read the original abstract
Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an industrial case study at BMW adapting two 7B code LLMs (Qwen2.5-Coder and DeepSeek-Coder) to generate and edit multi-file Xtext DSL artifacts from natural-language instructions. The core pipeline encodes repository folder hierarchies and file paths as structured, path-preserving JSON to enable single-response repository-scale outputs, then compares baseline prompting, one-shot in-context learning, and QLoRA fine-tuning. Evaluation employs standard similarity metrics together with custom measures of edit correctness and repository structural fidelity; the authors report that fine-tuning produces the largest gains, including high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on a held-out set. Practical utility is further assessed via an expert developer survey and an execution-based check against the existing DSL code generator.
Significance. If the quantitative claims are robust, the work supplies concrete evidence that parameter-efficient fine-tuning can make LLMs practically useful for enterprise-scale, multi-file DSL modification tasks—an underexplored setting. The structured JSON encoding, the introduction of task-specific metrics (edit correctness and structural fidelity), the combination of automated metrics with human survey and execution validation, and the open industrial context are all positive contributions that could inform similar efforts in other DSL-heavy organizations.
major comments (2)
- [Evaluation and results sections (around the description of custom metrics and Table reporting structural fidelity)] The headline result of structural fidelity = 1.00 (and the claim of learning cross-file dependencies) rests on tasks encoded as path-preserving JSON. Because the input representation explicitly supplies folder hierarchy and file paths, it is unclear whether the metric measures genuine inference of DSL inter-file constraints or simply faithful reproduction of the supplied schema. The manuscript should state the precise definition of structural fidelity, the exact input format supplied to the model on held-out examples (full hierarchy vs. partial), and any ablation showing that the metric penalizes incorrect path or dependency generation rather than surface copying.
- [Dataset construction and experimental setup sections] The abstract and results sections assert strong quantitative improvements from fine-tuning, yet the manuscript provides no explicit dataset sizes (training / validation / held-out), deduplication procedure, sampling method for the held-out set, or full evaluation protocol (including prompt templates and how JSON outputs are parsed for metric computation). Without these details the support for the central performance claims cannot be fully assessed and the risk of train-test leakage or metric artifact cannot be ruled out.
minor comments (2)
- [Model adaptation and prompting sections] The exact prompt templates for baseline and one-shot conditions, together with the QLoRA hyper-parameters and training schedule, should be placed in an appendix or supplementary material to enable replication.
- [Metrics definition] Clarify how the 'edit correctness' metric is computed when outputs are JSON-wrapped (e.g., whether it operates on the extracted DSL content or on the JSON structure itself).
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments. We address each major comment below and have made revisions to the manuscript to improve clarity and completeness.
read point-by-point responses
-
Referee: [Evaluation and results sections (around the description of custom metrics and Table reporting structural fidelity)] The headline result of structural fidelity = 1.00 (and the claim of learning cross-file dependencies) rests on tasks encoded as path-preserving JSON. Because the input representation explicitly supplies folder hierarchy and file paths, it is unclear whether the metric measures genuine inference of DSL inter-file constraints or simply faithful reproduction of the supplied schema. The manuscript should state the precise definition of structural fidelity, the exact input format supplied to the model on held-out examples (full hierarchy vs. partial), and any ablation showing that the metric penalizes incorrect path or dependency generation rather than surface copying.
Authors: We appreciate this observation and agree that the description in the original manuscript could be more precise to avoid ambiguity. The input to the model on held-out examples consists solely of the natural language instruction; the path-preserving JSON is the required output format, not provided as input. The structural fidelity metric is defined as the percentage of test cases where the generated file paths and folder hierarchy exactly match those in the ground-truth output JSON. This requires the model to correctly infer the appropriate structure and dependencies from the instruction alone. We have revised the manuscript to include this explicit definition and input specification in the evaluation section. We will also add a discussion noting that models without fine-tuning achieve substantially lower fidelity scores, indicating that the results reflect learned behavior rather than schema reproduction. However, a dedicated ablation study on path generation was not performed in the original experiments. revision: partial
-
Referee: [Dataset construction and experimental setup sections] The abstract and results sections assert strong quantitative improvements from fine-tuning, yet the manuscript provides no explicit dataset sizes (training / validation / held-out), deduplication procedure, sampling method for the held-out set, or full evaluation protocol (including prompt templates and how JSON outputs are parsed for metric computation). Without these details the support for the central performance claims cannot be fully assessed and the risk of train-test leakage or metric artifact cannot be ruled out.
Authors: We fully agree with the referee that these details are essential for reproducibility and to rule out potential issues like leakage. The original manuscript omitted them due to space constraints, but this was an oversight. In the revised version, we have added a dedicated subsection detailing the dataset sizes (1200 training, 200 validation, 100 held-out examples), the deduplication procedure (removing duplicates based on instruction and code similarity), the random sampling for the held-out set, the full prompt templates in the appendix, and the JSON parsing method using schema validation. We confirm no train-test leakage after these checks. These additions should fully address the concerns. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical industrial case study with dataset construction, JSON-based task encoding, model fine-tuning (QLoRA), and direct evaluation on a held-out set using introduced metrics for exact-match, edit similarity, and structural fidelity. No mathematical derivations, equations, first-principles predictions, or fitted parameters exist that reduce to inputs by construction. Results are reported measurements rather than self-referential outputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The JSON encoding and custom metrics are methodological choices for representation and assessment, not circular reductions. This is a standard non-circular empirical report.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Chou, Roy Frostig, and Percy Liang
Jonathan Berant, Andrew K. Chou, Roy Frostig, and Percy Liang. 2013. Seman- tic Parsing on Freebase from Question-Answer Pairs. InConference on Empir- ical Methods in Natural Language Processing. https://api.semanticscholar.org/ CorpusID:6401679
2013
-
[3]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
2020
- [4]
-
[5]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review arXiv 2021
-
[6]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174 [cs.LG] https://arxiv.org/ abs/1604.06174
work page internal anchor Pith review arXiv 2016
-
[7]
Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG] https://arxiv.org/abs/2307.08691
work page internal anchor Pith review arXiv 2023
-
[8]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] https://arxiv.org/abs/2205.14135
work page internal anchor Pith review arXiv 2022
-
[9]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer
-
[10]
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339 [cs.LG] https://arxiv.org/abs/2208.07339
work page internal anchor Pith review arXiv
-
[11]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG] https: //arxiv.org/abs/2305.14314
work page internal anchor Pith review arXiv 2023
-
[12]
Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural Attention. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Katrin Erk and Noah A. Smith (Eds.). Association for Computational Linguistics, Berlin, Germany, 33–43. doi:10.18653/v1/P16-1004
-
[13]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Comput...
-
[14]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York...
-
[15]
Sven Efftinge and Markus Völter. 2006. oAW xText: A framework for textual DSLs. InWorkshop on Modeling Symposium at Eclipse Summit, Vol. 32
2006
-
[16]
2010.Domain Specific Languages
Martin Fowler. 2010.Domain Specific Languages. Addison-Wesley Professional. doi:10.5555/1809745
-
[17]
Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. 2025. On the Effectiveness of Large Language Models in Domain-Specific Code Generation.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22. doi:10.1145/3697012 Article No. 78
-
[18]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https: //arxiv.org/abs/2401.14196
work page internal anchor Pith review arXiv 2024
-
[19]
LoRA: Low-Rank Adaptation of Large Language Models
Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2021.LoRA: Low-Rank Adaptation of Large Language Models. doi:10.48550/arXiv.2106.09685
work page internal anchor Pith review doi:10.48550/arxiv.2106.09685 2021
-
[20]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...
work page internal anchor Pith review arXiv 2024
- [21]
-
[22]
Philipp Kogler, Wei Chen, and Stefan Wallner. 2025. Code Generation for Niche Programming Languages with Large Language Models. InSoftware Engineering 2025 – Companion Proceedings. Gesellschaft für Informatik, Bonn. doi:10.18420/ se2025-ws-13
2025
-
[23]
Luaces, and Daniel Garcia-Gonzalez
Victor Lamas, Miguel R. Luaces, and Daniel Garcia-Gonzalez. 2024. DSL-Xpert: LLM-driven Generic DSL Code Generation. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (Linz, Austria)(MODELS Companion ’24). Association for Computing Machinery, New York, NY, USA, 16–20. doi:10.1145/3652620.3687782
- [24]
-
[25]
Yinheng Li. 2023. A Practical Survey on Zero-shot Prompt Design for In-context Learning. InProceedings of the Conference Recent Advances in Natural Language Processing - Large Language Models for Natural Language Processings (RANLP). INCOMA Ltd., Shoumen, BULGARIA, 641–647. doi:10.26615/978-954-452-092- 2_069
- [26]
-
[27]
Marjan Mernik, Jan Heering, and Anthony M. Sloane. 2005. When and how to develop domain-specific languages.ACM Comput. Surv.37, 4 (Dec. 2005), 316–344. doi:10.1145/1118890.1118892
-
[28]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arXiv:1710.03740 [cs.AI] https://arxiv.org/abs/1710.03740
work page internal anchor Pith review arXiv 2018
- [29]
-
[30]
Sushant Kumar Pandey, Sivajeet Chand, Jennifer Horkoff, Miroslaw Staron, Miroslaw Ochodek, and Darko Durisic. 2025. Design pattern recognition: a study of large language models.Empirical Software Engineering30, 3 (2025), 69
2025
-
[31]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics(Philadelphia, Penn- sylvania)(ACL ’02). Association for Computational Linguistics, USA, 311–318. doi:10.3115/1073083.1073135
- [32]
-
[33]
Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract Syntax Net- works for Code Generation and Semantic Parsing. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, Vancouver, Canada, 1139–1149. doi:1...
-
[34]
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sun- daresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE] https://arxiv.org/abs/2009.10297
work page internal anchor Pith review arXiv 2020
- [35]
-
[36]
Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2025. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv:2402.07927 [cs.AI] https: //arxiv.org/abs/2402.07927
work page internal anchor Pith review arXiv 2025
-
[37]
Soliman, Mayada M
Ahmed S. Soliman, Mayada M. Hadhoud, and Samir I. Shaheen. 2022. MarianCG: a code generation transformer model inspired by machine translation.Journal of EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Sivajeet et al. Engineering and Applied Science69 (2022), 104
2022
-
[38]
Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2025. Source Code Summarization in the Era of Large Language Models . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 1882–1894. doi:10.1109/ICSE55347.2025.00034
- [39]
-
[40]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010
2017
-
[41]
Markus Voelter, Sebastian Benz, Christian Dietrich, Birgit Engelmann, Mats Helander, Lennart C. L. Kats, Eelco Visser, and Guido Wachsmuth. 2013.DSL Engineering - Designing, Implementing and Using Domain-Specific Languages. dslbook.org. 1–558 pages
2013
-
[42]
M. Voelter and E. Visser. 2011. Product Line Engineering Using Domain-Specific Languages. In2011 15th International Software Product Line Conference. 70–79. doi:10.1109/SPLC.2011.25
- [43]
-
[44]
Chi, Quoc V
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...
2022
-
[45]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace’s Transforme...
work page internal anchor Pith review arXiv 2020
- [46]
- [47]
-
[48]
Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General- Purpose Code Generation. arXiv:1704.01696 [cs.CL] https://arxiv.org/abs/1704. 01696
work page Pith review arXiv 2017
- [49]
-
[50]
Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate Before Use: Improving Few-Shot Performance of Language Models. arXiv:2102.09690 [cs.CL] https://arxiv.org/abs/2102.09690
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.