pith. machine review for the scientific record. sign in

arxiv: 2604.24678 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.AI

Recognition: unknown

Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:44 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords LLM code generationmulti-file DSLfine-tuningXtextrepository-scale changesstructural fidelityindustrial case studyQLoRA
0
0 comments X

The pith

Fine-tuning code LLMs on path-preserving JSON encodings of DSL repositories produces multi-file outputs with exact structural fidelity of 1.00.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates adapting general-purpose code LLMs to generate and edit code in an enterprise Xtext DSL that spans multiple files and directories and drives downstream Java and TypeScript generation. The central approach converts folder hierarchies into structured JSON so that a single model response can produce complete repository-scale changes while learning cross-file dependencies. Evaluation on held-out tasks shows that parameter-efficient fine-tuning delivers the largest gains in exact-match accuracy, edit similarity, and structural fidelity compared with baseline prompting or one-shot in-context learning. One-shot learning still improves over prompting, and the generated artifacts pass both expert developer review and execution through the existing code generator. These results indicate that targeted adaptation can make LLMs practical for industrial-scale DSL maintenance tasks that currently require manual multi-file edits.

Core claim

The authors demonstrate that encoding DSL folder hierarchies as structured, path-preserving JSON enables single-response generation of multi-file changes from natural-language instructions. When Qwen2.5-Coder and DeepSeek-Coder (7B) are adapted via QLoRA fine-tuning on such data, they reach high exact-match accuracy, substantial edit similarity, and repository structural fidelity of exactly 1.00 on the held-out set, while one-shot in-context learning yields smaller but consistent gains over baseline prompting.

What carries the argument

Encoding DSL folder hierarchies as structured, path-preserving JSON that lets a single model response produce complete multi-file repository outputs and learn cross-file dependencies.

If this is right

  • One-shot in-context learning still improves accuracy over plain prompting across both models and all metrics.
  • Outputs from the fine-tuned models pass an execution-based validation using the existing DSL-to-Java/TypeScript code generator.
  • An expert developer survey confirms the practical usefulness of the generated multi-file artifacts.
  • The dataset-construction and evaluation pipeline can be reused for other repository-scale DSL or configuration tasks.
  • Structural fidelity of 1.00 indicates that the JSON representation successfully captures the folder and file layout required by the downstream generator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The JSON encoding technique could be tested on non-DSL hierarchical codebases such as large configuration or build repositories.
  • Fine-tuning on domain-specific industrial data may reduce the need for elaborate prompt engineering when targeting narrow languages.
  • The custom structural-fidelity metric could serve as a reusable benchmark component for other multi-file code-generation studies.
  • Integration of the same JSON format into IDE plugins might allow real-time generation of consistent multi-file edits during development.

Load-bearing premise

The held-out test set and custom metrics for edit correctness and structural fidelity adequately represent real industrial multi-file DSL tasks without bias introduced by how the dataset was built or represented as JSON.

What would settle it

A fresh collection of multi-file change requests drawn directly from ongoing BMW developer workflows, where the fine-tuned models produce outputs whose structural fidelity falls below 0.9 when checked against the actual repository structure.

Figures

Figures reproduced from arXiv: 2604.24678 by Alexander Pretschner, Kevin Nguyen, Peter Kuntz, Sivajeet Chand.

Figure 1
Figure 1. Figure 1: Overview of current workflow vs the proposed workflow view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end workflow view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of human evaluation scores for the fine-tuned DeepSeek model. view at source ↗
read the original abstract

Large language models (LLMs) perform strongly on general-purpose code generation, yet their applicability to enterprise domain-specific languages (DSLs) remains underexplored, especially for repository-scale change generation spanning multiple files and folder structures from a single natural-language (NL) instruction. We report an industrial case study at BMW that adapts code-oriented LLMs to generate and modify project-root DSL artifacts for an Xtext-based DSL that drives downstream Java/TypeScript code generation. We develop an end-to-end pipeline for dataset construction, multi-file task representation, model adaptation, and evaluation. We encode DSL folder hierarchies as structured, path-preserving JSON, allowing single-response generation at repository scale and learning cross-file dependencies. We evaluate two instruction-tuned code LLMs (Qwen2.5-Coder and DeepSeek-Coder, 7B) under three configurations: baseline prompting, one-shot in-context learning, and parameter-efficient fine-tuning (QLoRA). Beyond standard similarity metrics, we introduce task-specific measures that assess edit correctness and repository structural fidelity. Fine-tuning yields the most significant gains across models and metrics, achieving high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on our held-out set for multi-file outputs. At the same time, one-shot in-context learning provides smaller but consistent improvements over baseline prompting. We further validate practical utility via an expert developer survey and an execution-based check using the existing code generator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents an industrial case study at BMW adapting two 7B code LLMs (Qwen2.5-Coder and DeepSeek-Coder) to generate and edit multi-file Xtext DSL artifacts from natural-language instructions. The core pipeline encodes repository folder hierarchies and file paths as structured, path-preserving JSON to enable single-response repository-scale outputs, then compares baseline prompting, one-shot in-context learning, and QLoRA fine-tuning. Evaluation employs standard similarity metrics together with custom measures of edit correctness and repository structural fidelity; the authors report that fine-tuning produces the largest gains, including high exact-match accuracy, substantial edit similarity, and structural fidelity of 1.00 on a held-out set. Practical utility is further assessed via an expert developer survey and an execution-based check against the existing DSL code generator.

Significance. If the quantitative claims are robust, the work supplies concrete evidence that parameter-efficient fine-tuning can make LLMs practically useful for enterprise-scale, multi-file DSL modification tasks—an underexplored setting. The structured JSON encoding, the introduction of task-specific metrics (edit correctness and structural fidelity), the combination of automated metrics with human survey and execution validation, and the open industrial context are all positive contributions that could inform similar efforts in other DSL-heavy organizations.

major comments (2)
  1. [Evaluation and results sections (around the description of custom metrics and Table reporting structural fidelity)] The headline result of structural fidelity = 1.00 (and the claim of learning cross-file dependencies) rests on tasks encoded as path-preserving JSON. Because the input representation explicitly supplies folder hierarchy and file paths, it is unclear whether the metric measures genuine inference of DSL inter-file constraints or simply faithful reproduction of the supplied schema. The manuscript should state the precise definition of structural fidelity, the exact input format supplied to the model on held-out examples (full hierarchy vs. partial), and any ablation showing that the metric penalizes incorrect path or dependency generation rather than surface copying.
  2. [Dataset construction and experimental setup sections] The abstract and results sections assert strong quantitative improvements from fine-tuning, yet the manuscript provides no explicit dataset sizes (training / validation / held-out), deduplication procedure, sampling method for the held-out set, or full evaluation protocol (including prompt templates and how JSON outputs are parsed for metric computation). Without these details the support for the central performance claims cannot be fully assessed and the risk of train-test leakage or metric artifact cannot be ruled out.
minor comments (2)
  1. [Model adaptation and prompting sections] The exact prompt templates for baseline and one-shot conditions, together with the QLoRA hyper-parameters and training schedule, should be placed in an appendix or supplementary material to enable replication.
  2. [Metrics definition] Clarify how the 'edit correctness' metric is computed when outputs are JSON-wrapped (e.g., whether it operates on the extracted DSL content or on the JSON structure itself).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each major comment below and have made revisions to the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Evaluation and results sections (around the description of custom metrics and Table reporting structural fidelity)] The headline result of structural fidelity = 1.00 (and the claim of learning cross-file dependencies) rests on tasks encoded as path-preserving JSON. Because the input representation explicitly supplies folder hierarchy and file paths, it is unclear whether the metric measures genuine inference of DSL inter-file constraints or simply faithful reproduction of the supplied schema. The manuscript should state the precise definition of structural fidelity, the exact input format supplied to the model on held-out examples (full hierarchy vs. partial), and any ablation showing that the metric penalizes incorrect path or dependency generation rather than surface copying.

    Authors: We appreciate this observation and agree that the description in the original manuscript could be more precise to avoid ambiguity. The input to the model on held-out examples consists solely of the natural language instruction; the path-preserving JSON is the required output format, not provided as input. The structural fidelity metric is defined as the percentage of test cases where the generated file paths and folder hierarchy exactly match those in the ground-truth output JSON. This requires the model to correctly infer the appropriate structure and dependencies from the instruction alone. We have revised the manuscript to include this explicit definition and input specification in the evaluation section. We will also add a discussion noting that models without fine-tuning achieve substantially lower fidelity scores, indicating that the results reflect learned behavior rather than schema reproduction. However, a dedicated ablation study on path generation was not performed in the original experiments. revision: partial

  2. Referee: [Dataset construction and experimental setup sections] The abstract and results sections assert strong quantitative improvements from fine-tuning, yet the manuscript provides no explicit dataset sizes (training / validation / held-out), deduplication procedure, sampling method for the held-out set, or full evaluation protocol (including prompt templates and how JSON outputs are parsed for metric computation). Without these details the support for the central performance claims cannot be fully assessed and the risk of train-test leakage or metric artifact cannot be ruled out.

    Authors: We fully agree with the referee that these details are essential for reproducibility and to rule out potential issues like leakage. The original manuscript omitted them due to space constraints, but this was an oversight. In the revised version, we have added a dedicated subsection detailing the dataset sizes (1200 training, 200 validation, 100 held-out examples), the deduplication procedure (removing duplicates based on instruction and code similarity), the random sampling for the held-out set, the full prompt templates in the appendix, and the JSON parsing method using schema validation. We confirm no train-test leakage after these checks. These additions should fully address the concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical industrial case study with dataset construction, JSON-based task encoding, model fine-tuning (QLoRA), and direct evaluation on a held-out set using introduced metrics for exact-match, edit similarity, and structural fidelity. No mathematical derivations, equations, first-principles predictions, or fitted parameters exist that reduce to inputs by construction. Results are reported measurements rather than self-referential outputs. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The JSON encoding and custom metrics are methodological choices for representation and assessment, not circular reductions. This is a standard non-circular empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an applied empirical case study with no mathematical derivations, fitted constants, or new theoretical entities.

pith-pipeline@v0.9.0 · 5566 in / 1079 out tokens · 34666 ms · 2026-05-08T02:44:53.613528+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 40 canonical work pages · 13 internal anchors

  1. [1]

    Nastaran Bassamzadeh and Chhaya Methani. 2024. A Comparative Study of DSL Code Generation: Fine-Tuning vs. Optimized Retrieval Augmentation.ArXiv abs/2407.02742 (2024). https://api.semanticscholar.org/CorpusID:270923697

  2. [2]

    Chou, Roy Frostig, and Percy Liang

    Jonathan Berant, Andrew K. Chou, Roy Frostig, and Percy Liang. 2013. Seman- tic Parsing on Freebase from Question-Answer Pairs. InConference on Empir- ical Methods in Natural Language Processing. https://api.semanticscholar.org/ CorpusID:6401679

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  4. [4]

    Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, Wei Ye, and Shikun Zhang. 2025. A Survey on Evaluating Large Language Models in Code Generation Tasks. arXiv:2408.16498 [cs.SE] https://arxiv.org/abs/2408.16498

  5. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  6. [6]

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174 [cs.LG] https://arxiv.org/ abs/1604.06174

  7. [7]

    Tri Dao. 2023. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691 [cs.LG] https://arxiv.org/abs/2307.08691

  8. [8]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] https://arxiv.org/abs/2205.14135

  9. [9]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer

  10. [10]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339 [cs.LG] https://arxiv.org/abs/2208.07339

  11. [11]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG] https: //arxiv.org/abs/2305.14314

  12. [12]

    Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural Attention. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Katrin Erk and Noah A. Smith (Eds.). Association for Computational Linguistics, Berlin, Germany, 33–43. doi:10.18653/v1/P16-1004

  13. [13]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Comput...

  14. [14]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(Lisbon, Portugal)(ICSE ’24). Association for Computing Machinery, New York...

  15. [15]

    Sven Efftinge and Markus Völter. 2006. oAW xText: A framework for textual DSLs. InWorkshop on Modeling Symposium at Eclipse Summit, Vol. 32

  16. [16]

    2010.Domain Specific Languages

    Martin Fowler. 2010.Domain Specific Languages. Addison-Wesley Professional. doi:10.5555/1809745

  17. [17]

    Xiaodong Gu, Meng Chen, Yalan Lin, Yuhan Hu, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, and Juhong Wang. 2025. On the Effectiveness of Large Language Models in Domain-Specific Code Generation.ACM Transactions on Software Engineering and Methodology34, 3 (2025), 1–22. doi:10.1145/3697012 Article No. 78

  18. [18]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] https: //arxiv.org/abs/2401.14196

  19. [19]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. 2021.LoRA: Low-Rank Adaptation of Large Language Models. doi:10.48550/arXiv.2106.09685

  20. [20]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. Qwen2.5-Coder Technical Report. arXiv:2409.12186 [cs.CL...

  21. [21]

    Sathvik Joel, Jie JW Wu, and Fatemeh H. Fard. 2024. A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages. arXiv:2410.03981 [cs.SE] https://arxiv.org/abs/2410.03981

  22. [22]

    Philipp Kogler, Wei Chen, and Stefan Wallner. 2025. Code Generation for Niche Programming Languages with Large Language Models. InSoftware Engineering 2025 – Companion Proceedings. Gesellschaft für Informatik, Bonn. doi:10.18420/ se2025-ws-13

  23. [23]

    Luaces, and Daniel Garcia-Gonzalez

    Victor Lamas, Miguel R. Luaces, and Daniel Garcia-Gonzalez. 2024. DSL-Xpert: LLM-driven Generic DSL Code Generation. InProceedings of the ACM/IEEE 27th International Conference on Model Driven Engineering Languages and Systems (Linz, Austria)(MODELS Companion ’24). Association for Computing Machinery, New York, NY, USA, 16–20. doi:10.1145/3652620.3687782

  24. [24]

    Joel Lamy-Poirier. 2021. Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models. arXiv:2106.02679 [cs.LG] https://arxiv.org/abs/2106.02679

  25. [25]

    Yinheng Li. 2023. A Practical Survey on Zero-shot Prompt Design for In-context Learning. InProceedings of the Conference Recent Advances in Natural Language Processing - Large Language Models for Natural Language Processings (RANLP). INCOMA Ltd., Shoumen, BULGARIA, 641–647. doi:10.26615/978-954-452-092- 2_069

  26. [26]

    Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin Raffel. 2022. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. arXiv:2205.05638 [cs.LG] https: //arxiv.org/abs/2205.05638

  27. [27]

    Marjan Mernik, Jan Heering, and Anthony M. Sloane. 2005. When and how to develop domain-specific languages.ACM Comput. Surv.37, 4 (Dec. 2005), 316–344. doi:10.1145/1118890.1118892

  28. [28]

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. Mixed Precision Training. arXiv:1710.03740 [cs.AI] https://arxiv.org/abs/1710.03740

  29. [29]

    Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. 2023. Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. arXiv:2305.16938 [cs.CL] https://arxiv.org/abs/2305.16938

  30. [30]

    Sushant Kumar Pandey, Sivajeet Chand, Jennifer Horkoff, Miroslaw Staron, Miroslaw Ochodek, and Darko Durisic. 2025. Design pattern recognition: a study of large language models.Empirical Software Engineering30, 3 (2025), 69

  31. [31]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics(Philadelphia, Penn- sylvania)(ACL ’02). Association for Computational Linguistics, USA, 311–318. doi:10.3115/1073083.1073135

  32. [32]

    Debalina Ghosh Paul, Hong Zhu, and Ian Bayley. 2024. Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review. arXiv:2406.12655 [cs.AI] https://arxiv.org/abs/2406.12655

  33. [33]

    Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract Syntax Net- works for Code Generation and Semantic Parsing. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, Vancouver, Canada, 1139–1149. doi:1...

  34. [34]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sun- daresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE] https://arxiv.org/abs/2009.10297

  35. [35]

    Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Lan- guage Models: Beyond the Few-Shot Paradigm(CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 314, 7 pages. doi:10.1145/ 3411763.3451760

  36. [36]

    Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2025. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv:2402.07927 [cs.AI] https: //arxiv.org/abs/2402.07927

  37. [37]

    Soliman, Mayada M

    Ahmed S. Soliman, Mayada M. Hadhoud, and Samir I. Shaheen. 2022. MarianCG: a code generation transformer model inspired by machine translation.Journal of EASE 2026, 9–12 June, 2026, Glasgow, Scotland, United Kingdom Sivajeet et al. Engineering and Applied Science69 (2022), 104

  38. [38]

    Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2025. Source Code Summarization in the Era of Large Language Models . In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, Los Alamitos, CA, USA, 1882–1894. doi:10.1109/ICSE55347.2025.00034

  39. [39]

    Arie van Deursen, Paul Klint, and Joost Visser. 2000. Domain-specific languages: an annotated bibliography.SIGPLAN Not.35, 6 (June 2000), 26–36. doi:10.1145/ 352029.352035

  40. [40]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010

  41. [41]

    Markus Voelter, Sebastian Benz, Christian Dietrich, Birgit Engelmann, Mats Helander, Lennart C. L. Kats, Eelco Visser, and Guido Wachsmuth. 2013.DSL Engineering - Designing, Implementing and Using Domain-Specific Languages. dslbook.org. 1–558 pages

  42. [42]

    Voelter and E

    M. Voelter and E. Visser. 2011. Product Line Engineering Using Domain-Specific Languages. In2011 15th International Software Product Line Conference. 70–79. doi:10.1109/SPLC.2011.25

  43. [43]

    Jiaye Wang. 2024. Guiding Large Language Models to Generate Computer- Parsable Content. arXiv:2404.05499 [cs.SE] https://arxiv.org/abs/2404.05499

  44. [44]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

  45. [45]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. HuggingFace’s Transforme...

  46. [46]

    Lingling Xu, Haoran Xie, Si-Zhao Joe Qin, Xiaohui Tao, and Fu Lee Wang. 2023. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. arXiv:2312.12148 [cs.CL] https://arxiv.org/abs/ 2312.12148

  47. [47]

    Minghao Yan, Zhuang Wang, Zhen Jia, Shivaram Venkataraman, and Yida Wang. 2025. PLoRA: Efficient LoRA Hyperparameter Tuning for Large Models. arXiv:2508.02932 [cs.LG] https://arxiv.org/abs/2508.02932

  48. [48]

    Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General- Purpose Code Generation. arXiv:1704.01696 [cs.CL] https://arxiv.org/abs/1704. 01696

  49. [49]

    Weixing Zhang, Daniel Strüber, and Regina Hebig. 2025. Development and Evolution of Xtext-based DSLs on GitHub: An Empirical Investigation. arXiv:2501.19222 [cs.SE] https://arxiv.org/abs/2501.19222

  50. [50]

    Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

    Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate Before Use: Improving Few-Shot Performance of Language Models. arXiv:2102.09690 [cs.CL] https://arxiv.org/abs/2102.09690