High-quality generation of dynamic game content via small language models: A proof of concept
Pith reviewed 2026-05-21 14:25 UTC · model grok-4.3
The pith
Small language models generate high-quality dynamic game content through narrow fine-tuning and retries
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training small language models on deliberately narrow game tasks with data synthesized from a directed graph of the game world, the authors create a model that powers reputation battles in a minimal role-playing scenario. A retry-until-success approach then filters outputs to meet quality standards set by an automated judge, resulting in performance that fits within the timing requirements of real-time game engines.
What carries the argument
The combination of scope-limited fine-tuning on DAG-generated synthetic data and a retry-until-success output filter
If this is right
- These specialized models can be combined into networks that handle broader narrative structures in games
- Dynamic content becomes feasible in offline or resource-limited game environments
- Latency remains predictable, supporting seamless integration into game loops
- The method reduces reliance on large models that often produce incoherent narratives
Where Pith is reading between the lines
- This technique could apply to generating other interactive elements, such as quest descriptions or character backstories in real time
- Further work might explore whether the same narrow-scoping idea works for non-text content like level design suggestions
- A key open question is how well the automated quality checks match what actual players enjoy during play
- Developers might build tools that automatically generate the training graphs from existing game design documents
Load-bearing premise
The automated judge using another language model correctly identifies outputs that humans would consider high quality and coherent for game use
What would settle it
Have people play the generated reputation battles and rate them on coherence and engagement, then compare those ratings to the judge model's scores to see if they agree
Figures
read the original abstract
Large language models (LLMs) offer promise for dynamic game content generation, but they face critical barriers, including narrative incoherence and high operational costs. Due to their large size, they are often accessed in the cloud, limiting their application in offline games. Many of these practical issues are solved by pivoting to small language models (SLMs), but existing studies using SLMs have resulted in poor output quality. We propose a strategy of achieving high-quality SLM generation through aggressive fine-tuning on deliberately scoped tasks with narrow context, constrained structure, or both. In short, more difficult tasks require narrower scope and higher specialization to the training corpus. Training data is synthetically generated via a DAG-based approach, grounding models in the specific game world. Such models can form the basis for agentic networks designed around the narratological framework at hand, representing a more practical and robust solution than cloud-dependent LLMs. To validate this approach, we present a proof-of-concept focusing on a single specialized SLM as the fundamental building block. We introduce a minimal RPG loop revolving around rhetorical battles of reputations, powered by this model. We demonstrate that a simple retry-until-success strategy reaches adequate quality (as defined by an LLM-as-a-judge scheme) with predictable latency suitable for real-time generation. While local quality assessment remains an open question, our results demonstrate feasibility for real-time generation under typical game engine constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that small language models (SLMs), when aggressively fine-tuned on narrowly scoped tasks with synthetic DAG-generated training data grounded in a specific game world, can produce high-quality dynamic game content. It presents a proof-of-concept for a minimal RPG rhetorical-battle loop and shows that a retry-until-success strategy achieves adequate quality (per an LLM-as-judge metric) with predictable latency suitable for real-time generation under typical game-engine constraints.
Significance. If the central feasibility result holds, the work would demonstrate a practical, offline-capable route to dynamic narrative content that avoids the cost and connectivity requirements of large cloud LLMs, while opening a path toward composable agentic networks built on narratological frameworks.
major comments (2)
- [Abstract] Abstract: the central claim that the retry-until-success strategy 'reaches adequate quality ... suitable for real-time generation' rests entirely on an LLM-as-a-judge scheme whose relationship to the generator SLM (shared pre-training data, architecture family, or fine-tuning distribution) is unspecified; no human baseline, inter-annotator agreement, or correlation study is reported, leaving the support for both 'high-quality' and real-time suitability thin.
- [Abstract] The manuscript acknowledges that 'local quality assessment remains an open question' yet draws conclusions about real-time suitability and narrative coherence from the unvalidated proxy; this circularity risk is load-bearing for the feasibility demonstration.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from an explicit statement of the precise scope of the single specialized SLM (e.g., input/output formats, context window, fine-tuning corpus size) to allow readers to assess how 'narrow' the scoping actually is.
- [Abstract] No error bars, latency distributions, or retry-count statistics are mentioned in the provided abstract; adding these would strengthen the 'predictable latency' claim even under the current evaluation protocol.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our proof-of-concept manuscript. We address the two major comments point by point below, proposing targeted revisions to improve clarity around our evaluation approach while preserving the scoped nature of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the retry-until-success strategy 'reaches adequate quality ... suitable for real-time generation' rests entirely on an LLM-as-a-judge scheme whose relationship to the generator SLM (shared pre-training data, architecture family, or fine-tuning distribution) is unspecified; no human baseline, inter-annotator agreement, or correlation study is reported, leaving the support for both 'high-quality' and real-time suitability thin.
Authors: We agree that the abstract would benefit from explicit details on the judge model. The judge is a separately fine-tuned SLM from the same architecture family, trained on a distinct synthetic assessment dataset with no overlapping examples from the generator's training distribution. Real-time suitability is supported by direct latency measurements collected during generation runs, which are independent of the quality scores. As this is a proof-of-concept, we did not include a human baseline or correlation study; we will revise the abstract and methods to specify the judge configuration, separate latency results from quality claims, and add a limitations paragraph noting the proxy nature of the metric and the value of future human validation. revision: yes
-
Referee: [Abstract] The manuscript acknowledges that 'local quality assessment remains an open question' yet draws conclusions about real-time suitability and narrative coherence from the unvalidated proxy; this circularity risk is load-bearing for the feasibility demonstration.
Authors: The explicit acknowledgment that local quality assessment remains open was meant to frame the work as preliminary. The feasibility claims center on the retry-until-success strategy producing outputs that pass the scoped judge within measured, predictable latencies compatible with game-engine constraints; narrative coherence is demonstrated only within the minimal rhetorical-battle loop. To reduce any appearance of circularity, we will revise the abstract to qualify conclusions more precisely, distinguish direct latency evidence from proxy-based quality, and expand the discussion to articulate the role and limitations of the automated judge. revision: yes
Circularity Check
No significant circularity in empirical proof-of-concept demonstration
full rationale
The paper is an empirical proof of concept for scoped fine-tuning of SLMs on DAG-generated synthetic data for a narrow RPG rhetorical-battle task, followed by a retry-until-success generation loop. The central demonstration is that this pipeline produces outputs judged adequate by an LLM-as-a-judge scheme with predictable latency. No equations, first-principles derivations, or fitted parameters are presented that reduce to their own inputs by construction. The authors explicitly state that local quality assessment remains an open question, so the judge proxy is not treated as a load-bearing self-definition. The approach does not rely on self-citations for uniqueness, ansatzes smuggled from prior work, or renaming of known results. The reported feasibility for real-time use under game-engine constraints is therefore self-contained and independent of the inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Narrowly scoped tasks with constrained structure allow small models to reach adequate output quality after aggressive fine-tuning.
- domain assumption DAG-based synthetic data generation produces training examples that sufficiently ground the model in the target game world.
Reference graph
Works this paper leans on
-
[1]
Suzan Al-Nassar, Anthonie Schaap, Michael Van Der Zwart, Mike Preuss, and Marcello A Gómez-Maureira. 2023. Questville: Procedural quest generation using nlp models. In Proceedings of the 18th International Conference on the Foundations of Digital Games. 1–4. doi:10.1145/3582437.3587188
-
[2]
Anirudh Atmakuru, Jatin Nainani, Rohith Siddhartha Reddy Bheemreddy, Anirudh Lakkaraju, Zonghai Yao, Hamed Zamani, and Haw-Shiuan Chang. 2024. Cs4: Measuring the creativity of large language models automatically by controlling the number of story-writing constraints.arXiv preprint arXiv:2410.04197(2024). doi:10.48550/arXiv.2410.04197
-
[3]
Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small Language Models Are the Future of Agentic AI. (2025). doi:10.48550/arXiv.2506.02153
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.02153 2025
-
[4]
S. Cass. 2002. Mind games - computer game AI.IEEE Spectrum39, 12 (2002), 40–44. doi:10.1109/MSPEC.2002.1088444
-
[5]
1968.An Introduction to Probability Theory and Its Applications(3 ed.)
William Feller. 1968.An Introduction to Probability Theory and Its Applications(3 ed.). Vol. 1. John Wiley & Sons, New York
work page 1968
-
[6]
Roberto Gallotta, Graham Todd, Marvin Zammit, Sam Earle, Antonios Liapis, Julian Togelius, and Georgios N. Yannakakis. 2024. Large Language Models and Games: A Survey and Roadmap.IEEE Transactions on Games(2024), 1–18. doi:10.1109/TG.2024.3461510
-
[7]
Georgi Gerganov. 2023.llama.cpp. https://github.com/ggerganov/llama.cpp Accessed: 2025
work page 2023
-
[8]
P. Gervás, E. Concepción, C. León, G. Méndez, and P. Delatorre. 2019. The Long Path to Narrative Generation.IBM Journal of Research and Development63, 1 (Jan. 2019), 8:1–8:10. doi:10.1147/JRD.2019.2896157
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models.arXiv preprint(2024). arXiv:2407.21783 [cs.LG] doi:10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[10]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[11]
Mathew Huerta-Enochian and Seung Yong Ko. 2024. Instruction Fine-Tuning: Does Prompt Loss Matter?arXiv preprint(2024). doi:10.48550/arXiv.2401.13586
-
[12]
Jin Jeong and Tak Yeon Lee. 2025. LIGS: Developing an LLM-infused Game System for Emergent Narrative. 1–12. doi:10.1145/3706599.3720212
-
[13]
Christoph Klimmt, Tilo Hartmann, and Andreas Frey. 2007. Effectance and control as determinants of video game enjoyment.Cyberpsychology & behavior10, 6 (2007), 845–848. doi:10.1089/cpb.2007.9942
-
[14]
Leonid Legashev, Alexander Shukhman, Vadim Badikov, and Vladislav Kurynov. 2025. Using Large Language Models for Goal-Oriented Dialogue Systems.Applied Sciences15, 9 (2025), 4687. doi:10.3390/app15094687
-
[15]
Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. 2025. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-Judge. InProceedings of EMNLP 2025. 2757–2791. doi:10.18653/v1/2025.emnlp-main.138
-
[16]
Guillermo Marco, Luz Rello, and Julio Gonzalo. 2025. Small Language Models Can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs. (2025). doi:10.48550/arXiv.2409.11547
- [17]
-
[18]
Matthias Müller-Brockhausen, Giulio Barbero, and Mike Preuss. 2023. Chatter Generation through Language Models. In2023 IEEE Conference on Games (CoG)(2023-08). 1–6. doi:10.1109/CoG57401.2023.10333244
-
[19]
OpenAI. 2024. GPT-4 Technical Report.arXiv preprint(2024). doi:10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2024
-
[20]
OpenAI. 2024. GPT-4o System Card.arXiv preprint(2024). doi:10.48550/arXiv.2410.21276
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.21276 2024
-
[21]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22. doi:10.1145/3586183.3606763
-
[22]
Rasmus Ploug, Emil Rimer, Anthon Kristian Skov Petersen, and Marco Scirea. 2025. Open-Ended NPC Dialogue Favors Casual Players: A Pilot Comparison of Three LLM-Driven Dialogue Systems. In2025 IEEE Conference on Games (CoG). IEEE, 1–4. doi:10.1109/CoG64752.2025.11114150
-
[23]
Project Apertus. 2025. Apertus: Democratizing Open and Compliant LLMs for Global Language Environments.arXiv preprint(2025). doi:10.48550/arXiv.2509.14233
-
[24]
Hua Qin, Pei-Luen Rau, and Gavriel Salvendy. 2009. Measuring Player Immersion in the Computer Game Narrative.Int. J. Hum. Comput. Interaction25 (Feb. 2009), 107–133. doi:10.1080/10447310802546732
-
[25]
Henrik Schoenau-Fog. 2011. The player engagement process–an exploration of continuation desire in digital games. InProceedings of DiGRA 2011 Conference: Think Design Play. Digital Games Research Association. doi:10.26503/dl.v2011i1.540
-
[26]
Penny Sweetser. 2024. Large language models and video games: A preliminary scoping review. InProceedings of the 6th ACM Conference on Conversational User Interfaces. 1–8. doi:10.1145/3640794.3665582
-
[27]
Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A
Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. 2025. Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks. InProceedings of IUI 2025. 952–966. doi:10.1145/3708359.3712091
-
[28]
Chen Feng Tsai, Xiaochen Zhou, Sierra S Liu, Jing Li, Mo Yu, and Hongyuan Mei. 2023. Can large language models play text games well? Current state-of-the-art and open questions.arXiv preprint arXiv:2304.02868(2023). doi:10.48550/arXiv.2304.02868
-
[29]
Susanna Värtinen, Perttu Hämäläinen, and Christian Guckelsberger. 2024. Generating Role-Playing Game Quests With GPT Language Models.IEEE Transactions on Games16, 1 (March 2024), 127–139. doi:10.1109/TG.2022.3228480
-
[30]
2020.TRL: Transformer Reinforcement Learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020.TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl
work page 2020
-
[31]
Smith, Daniel Khashabi, and Hannaneh Hajishirzi
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. InProceedings of ACL 2023. 13484–13508. doi:10.18653/v1/2023.acl-long.754
-
[32]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31...
work page 2022
-
[33]
Yunge Wen, Chenliang Huang, Hangyu Zhou, Zhuo Zeng, Chun Ming Louis Po, Julian Togelius, Timothy Merino, and Sam Earle. 2025. All Stories Are One Story: Emotional Arc Guided Procedural Game Level Generation. (2025). doi:10.48550/arXiv.2508.02132
-
[34]
Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. AI chains: Transparent and controllable human-AI interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–22. doi:10.1145/3491102.3517582
-
[35]
Yannakakis and Julian Togelius
Georgios N. Yannakakis and Julian Togelius. 2025.Artificial Intelligence and Games. Springer Nature. https://gameaibook.org
work page 2025
-
[36]
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large Language Models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=5NTt8GFjUHkr
work page 2023
-
[37]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems36 (2023), 46595–46623. https: //proceedings.neurips.cc/paper_files/pa...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.