pith. sign in

arxiv: 2601.23206 · v2 · pith:GKJBZ4Y7new · submitted 2026-01-30 · 💻 cs.AI

High-quality generation of dynamic game content via small language models: A proof of concept

Pith reviewed 2026-05-21 14:25 UTC · model grok-4.3

classification 💻 cs.AI
keywords small language modelsdynamic content generationgame AIfine-tuningprocedural contentnarrative generationreal-time systemsquality evaluation
0
0 comments X

The pith

Small language models generate high-quality dynamic game content through narrow fine-tuning and retries

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to show that small language models can create coherent, dynamic game content if they are fine-tuned aggressively on very specific tasks with limited context or structure. The reason a reader would care is that this could let games generate personalized stories and interactions locally on the player's device, avoiding the expense and delays of large cloud-based models. The authors use synthetic data created through a structured graph to keep the model tied to the game's world and lore. They test this in a simple role-playing game where characters battle over reputations, using a loop that keeps generating until the result passes an automatic check. This makes the output reliable enough and fast enough to use during actual gameplay.

Core claim

By training small language models on deliberately narrow game tasks with data synthesized from a directed graph of the game world, the authors create a model that powers reputation battles in a minimal role-playing scenario. A retry-until-success approach then filters outputs to meet quality standards set by an automated judge, resulting in performance that fits within the timing requirements of real-time game engines.

What carries the argument

The combination of scope-limited fine-tuning on DAG-generated synthetic data and a retry-until-success output filter

If this is right

  • These specialized models can be combined into networks that handle broader narrative structures in games
  • Dynamic content becomes feasible in offline or resource-limited game environments
  • Latency remains predictable, supporting seamless integration into game loops
  • The method reduces reliance on large models that often produce incoherent narratives

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique could apply to generating other interactive elements, such as quest descriptions or character backstories in real time
  • Further work might explore whether the same narrow-scoping idea works for non-text content like level design suggestions
  • A key open question is how well the automated quality checks match what actual players enjoy during play
  • Developers might build tools that automatically generate the training graphs from existing game design documents

Load-bearing premise

The automated judge using another language model correctly identifies outputs that humans would consider high quality and coherent for game use

What would settle it

Have people play the generated reputation battles and rate them on coherence and engagement, then compare those ratings to the judge model's scores to see if they agree

Figures

Figures reproduced from arXiv: 2601.23206 by Arturo Valdivia, Morten I. K. Munk, Paolo Burelli.

Figure 3
Figure 3. Figure 3: Each component of 𝑧® is either selected from an appropriate predefined list (choice nodes) or generated using any elements up until the given point in the DAG execution (generation nodes). The lists may be nested, such that origin country may affect which social classes a character can belong to, for example. Choices at any given point in the DAG flow may affect which lists are available downstream. Variat… view at source ↗
Figure 2
Figure 2. Figure 2: Input/output structure for DefameLM with example outputs from ChatGPT-4o (training data gold standard), DefameLM [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of the DAG-based data generation using [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training data generation and fine-tuning pipeline. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mean LLM-as-judge scores per metric for DefameLM [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: a shows the empirical cumulative distribution function [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of generation efficiency across quanti [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Large language models (LLMs) offer promise for dynamic game content generation, but they face critical barriers, including narrative incoherence and high operational costs. Due to their large size, they are often accessed in the cloud, limiting their application in offline games. Many of these practical issues are solved by pivoting to small language models (SLMs), but existing studies using SLMs have resulted in poor output quality. We propose a strategy of achieving high-quality SLM generation through aggressive fine-tuning on deliberately scoped tasks with narrow context, constrained structure, or both. In short, more difficult tasks require narrower scope and higher specialization to the training corpus. Training data is synthetically generated via a DAG-based approach, grounding models in the specific game world. Such models can form the basis for agentic networks designed around the narratological framework at hand, representing a more practical and robust solution than cloud-dependent LLMs. To validate this approach, we present a proof-of-concept focusing on a single specialized SLM as the fundamental building block. We introduce a minimal RPG loop revolving around rhetorical battles of reputations, powered by this model. We demonstrate that a simple retry-until-success strategy reaches adequate quality (as defined by an LLM-as-a-judge scheme) with predictable latency suitable for real-time generation. While local quality assessment remains an open question, our results demonstrate feasibility for real-time generation under typical game engine constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that small language models (SLMs), when aggressively fine-tuned on narrowly scoped tasks with synthetic DAG-generated training data grounded in a specific game world, can produce high-quality dynamic game content. It presents a proof-of-concept for a minimal RPG rhetorical-battle loop and shows that a retry-until-success strategy achieves adequate quality (per an LLM-as-judge metric) with predictable latency suitable for real-time generation under typical game-engine constraints.

Significance. If the central feasibility result holds, the work would demonstrate a practical, offline-capable route to dynamic narrative content that avoids the cost and connectivity requirements of large cloud LLMs, while opening a path toward composable agentic networks built on narratological frameworks.

major comments (2)
  1. [Abstract] Abstract: the central claim that the retry-until-success strategy 'reaches adequate quality ... suitable for real-time generation' rests entirely on an LLM-as-a-judge scheme whose relationship to the generator SLM (shared pre-training data, architecture family, or fine-tuning distribution) is unspecified; no human baseline, inter-annotator agreement, or correlation study is reported, leaving the support for both 'high-quality' and real-time suitability thin.
  2. [Abstract] The manuscript acknowledges that 'local quality assessment remains an open question' yet draws conclusions about real-time suitability and narrative coherence from the unvalidated proxy; this circularity risk is load-bearing for the feasibility demonstration.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from an explicit statement of the precise scope of the single specialized SLM (e.g., input/output formats, context window, fine-tuning corpus size) to allow readers to assess how 'narrow' the scoping actually is.
  2. [Abstract] No error bars, latency distributions, or retry-count statistics are mentioned in the provided abstract; adding these would strengthen the 'predictable latency' claim even under the current evaluation protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our proof-of-concept manuscript. We address the two major comments point by point below, proposing targeted revisions to improve clarity around our evaluation approach while preserving the scoped nature of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the retry-until-success strategy 'reaches adequate quality ... suitable for real-time generation' rests entirely on an LLM-as-a-judge scheme whose relationship to the generator SLM (shared pre-training data, architecture family, or fine-tuning distribution) is unspecified; no human baseline, inter-annotator agreement, or correlation study is reported, leaving the support for both 'high-quality' and real-time suitability thin.

    Authors: We agree that the abstract would benefit from explicit details on the judge model. The judge is a separately fine-tuned SLM from the same architecture family, trained on a distinct synthetic assessment dataset with no overlapping examples from the generator's training distribution. Real-time suitability is supported by direct latency measurements collected during generation runs, which are independent of the quality scores. As this is a proof-of-concept, we did not include a human baseline or correlation study; we will revise the abstract and methods to specify the judge configuration, separate latency results from quality claims, and add a limitations paragraph noting the proxy nature of the metric and the value of future human validation. revision: yes

  2. Referee: [Abstract] The manuscript acknowledges that 'local quality assessment remains an open question' yet draws conclusions about real-time suitability and narrative coherence from the unvalidated proxy; this circularity risk is load-bearing for the feasibility demonstration.

    Authors: The explicit acknowledgment that local quality assessment remains open was meant to frame the work as preliminary. The feasibility claims center on the retry-until-success strategy producing outputs that pass the scoped judge within measured, predictable latencies compatible with game-engine constraints; narrative coherence is demonstrated only within the minimal rhetorical-battle loop. To reduce any appearance of circularity, we will revise the abstract to qualify conclusions more precisely, distinguish direct latency evidence from proxy-based quality, and expand the discussion to articulate the role and limitations of the automated judge. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical proof-of-concept demonstration

full rationale

The paper is an empirical proof of concept for scoped fine-tuning of SLMs on DAG-generated synthetic data for a narrow RPG rhetorical-battle task, followed by a retry-until-success generation loop. The central demonstration is that this pipeline produces outputs judged adequate by an LLM-as-a-judge scheme with predictable latency. No equations, first-principles derivations, or fitted parameters are presented that reduce to their own inputs by construction. The authors explicitly state that local quality assessment remains an open question, so the judge proxy is not treated as a load-bearing self-definition. The approach does not rely on self-citations for uniqueness, ansatzes smuggled from prior work, or renaming of known results. The reported feasibility for real-time use under game-engine constraints is therefore self-contained and independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract supplies limited technical detail; the approach implicitly rests on the assumption that narrowing task scope and using game-world synthetic data will overcome SLM quality limitations, with no explicit free parameters or new entities named.

axioms (2)
  • domain assumption Narrowly scoped tasks with constrained structure allow small models to reach adequate output quality after aggressive fine-tuning.
    Invoked in the proposal that 'more difficult tasks require narrower scope and higher specialization' to achieve high-quality generation.
  • domain assumption DAG-based synthetic data generation produces training examples that sufficiently ground the model in the target game world.
    Stated as the method for creating training data that keeps outputs consistent with the specific game setting.

pith-pipeline@v0.9.0 · 5792 in / 1352 out tokens · 31510 ms · 2026-05-21T14:25:04.964745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

  1. [1]

    Suzan Al-Nassar, Anthonie Schaap, Michael Van Der Zwart, Mike Preuss, and Marcello A Gómez-Maureira. 2023. Questville: Procedural quest generation using nlp models. In Proceedings of the 18th International Conference on the Foundations of Digital Games. 1–4. doi:10.1145/3582437.3587188

  2. [2]

    Anirudh Atmakuru, Jatin Nainani, Rohith Siddhartha Reddy Bheemreddy, Anirudh Lakkaraju, Zonghai Yao, Hamed Zamani, and Haw-Shiuan Chang. 2024. Cs4: Measuring the creativity of large language models automatically by controlling the number of story-writing constraints.arXiv preprint arXiv:2410.04197(2024). doi:10.48550/arXiv.2410.04197

  3. [3]

    Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small Language Models Are the Future of Agentic AI. (2025). doi:10.48550/arXiv.2506.02153

  4. [4]

    S. Cass. 2002. Mind games - computer game AI.IEEE Spectrum39, 12 (2002), 40–44. doi:10.1109/MSPEC.2002.1088444

  5. [5]

    1968.An Introduction to Probability Theory and Its Applications(3 ed.)

    William Feller. 1968.An Introduction to Probability Theory and Its Applications(3 ed.). Vol. 1. John Wiley & Sons, New York

  6. [6]

    Yannakakis

    Roberto Gallotta, Graham Todd, Marvin Zammit, Sam Earle, Antonios Liapis, Julian Togelius, and Georgios N. Yannakakis. 2024. Large Language Models and Games: A Survey and Roadmap.IEEE Transactions on Games(2024), 1–18. doi:10.1109/TG.2024.3461510

  7. [7]

    2023.llama.cpp

    Georgi Gerganov. 2023.llama.cpp. https://github.com/ggerganov/llama.cpp Accessed: 2025

  8. [8]

    Gervás, E

    P. Gervás, E. Concepción, C. León, G. Méndez, and P. Delatorre. 2019. The Long Path to Narrative Generation.IBM Journal of Research and Development63, 1 (Jan. 2019), 8:1–8:10. doi:10.1147/JRD.2019.2896157

  9. [9]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models.arXiv preprint(2024). arXiv:2407.21783 [cs.LG] doi:10.48550/arXiv.2407.21783

  10. [10]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9

  11. [11]

    Mathew Huerta-Enochian and Seung Yong Ko. 2024. Instruction Fine-Tuning: Does Prompt Loss Matter?arXiv preprint(2024). doi:10.48550/arXiv.2401.13586

  12. [12]

    Jin Jeong and Tak Yeon Lee. 2025. LIGS: Developing an LLM-infused Game System for Emergent Narrative. 1–12. doi:10.1145/3706599.3720212

  13. [13]

    Christoph Klimmt, Tilo Hartmann, and Andreas Frey. 2007. Effectance and control as determinants of video game enjoyment.Cyberpsychology & behavior10, 6 (2007), 845–848. doi:10.1089/cpb.2007.9942

  14. [14]

    Leonid Legashev, Alexander Shukhman, Vadim Badikov, and Vladislav Kurynov. 2025. Using Large Language Models for Goal-Oriented Dialogue Systems.Applied Sciences15, 9 (2025), 4687. doi:10.3390/app15094687

  15. [15]

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, et al. 2025. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-Judge. InProceedings of EMNLP 2025. 2757–2791. doi:10.18653/v1/2025.emnlp-main.138

  16. [16]

    Guillermo Marco, Luz Rello, and Julio Gonzalo. 2025. Small Language Models Can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs. (2025). doi:10.48550/arXiv.2409.11547

  17. [17]

    Kathryn Merrick. 2007. Modeling motivation for adaptive nonplayer characters in dynamic computer game worlds.ACM Computers in Entertainment5, 4 (2007). doi:10.1145/ 1324198.1324203

  18. [18]

    Matthias Müller-Brockhausen, Giulio Barbero, and Mike Preuss. 2023. Chatter Generation through Language Models. In2023 IEEE Conference on Games (CoG)(2023-08). 1–6. doi:10.1109/CoG57401.2023.10333244

  19. [19]

    OpenAI. 2024. GPT-4 Technical Report.arXiv preprint(2024). doi:10.48550/arXiv.2303.08774

  20. [20]

    OpenAI. 2024. GPT-4o System Card.arXiv preprint(2024). doi:10.48550/arXiv.2410.21276

  21. [21]

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology. 1–22. doi:10.1145/3586183.3606763

  22. [22]

    Rasmus Ploug, Emil Rimer, Anthon Kristian Skov Petersen, and Marco Scirea. 2025. Open-Ended NPC Dialogue Favors Casual Players: A Pilot Comparison of Three LLM-Driven Dialogue Systems. In2025 IEEE Conference on Games (CoG). IEEE, 1–4. doi:10.1109/CoG64752.2025.11114150

  23. [23]

    Project Apertus. 2025. Apertus: Democratizing Open and Compliant LLMs for Global Language Environments.arXiv preprint(2025). doi:10.48550/arXiv.2509.14233

  24. [24]

    Hua Qin, Pei-Luen Rau, and Gavriel Salvendy. 2009. Measuring Player Immersion in the Computer Game Narrative.Int. J. Hum. Comput. Interaction25 (Feb. 2009), 107–133. doi:10.1080/10447310802546732

  25. [25]

    Henrik Schoenau-Fog. 2011. The player engagement process–an exploration of continuation desire in digital games. InProceedings of DiGRA 2011 Conference: Think Design Play. Digital Games Research Association. doi:10.26503/dl.v2011i1.540

  26. [26]

    Penny Sweetser. 2024. Large language models and video games: A preliminary scoping review. InProceedings of the 6th ACM Conference on Conversational User Interfaces. 1–8. doi:10.1145/3640794.3665582

  27. [27]

    Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A

    Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. 2025. Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks. InProceedings of IUI 2025. 952–966. doi:10.1145/3708359.3712091

  28. [28]

    Chen Feng Tsai, Xiaochen Zhou, Sierra S Liu, Jing Li, Mo Yu, and Hongyuan Mei. 2023. Can large language models play text games well? Current state-of-the-art and open questions.arXiv preprint arXiv:2304.02868(2023). doi:10.48550/arXiv.2304.02868

  29. [29]

    Susanna Värtinen, Perttu Hämäläinen, and Christian Guckelsberger. 2024. Generating Role-Playing Game Quests With GPT Language Models.IEEE Transactions on Games16, 1 (March 2024), 127–139. doi:10.1109/TG.2022.3228480

  30. [30]

    2020.TRL: Transformer Reinforcement Learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. 2020.TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl

  31. [31]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Language Models with Self-Generated Instructions. InProceedings of ACL 2023. 13484–13508. doi:10.18653/v1/2023.acl-long.754

  32. [32]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31...

  33. [33]

    Yunge Wen, Chenliang Huang, Hangyu Zhou, Zhuo Zeng, Chun Ming Louis Po, Julian Togelius, Timothy Merino, and Sam Earle. 2025. All Stories Are One Story: Emotional Arc Guided Procedural Game Level Generation. (2025). doi:10.48550/arXiv.2508.02132

  34. [34]

    Tongshuang Wu, Michael Terry, and Carrie Jun Cai. 2022. AI chains: Transparent and controllable human-AI interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–22. doi:10.1145/3491102.3517582

  35. [35]

    Yannakakis and Julian Togelius

    Georgios N. Yannakakis and Julian Togelius. 2025.Artificial Intelligence and Games. Springer Nature. https://gameaibook.org

  36. [36]

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large Language Models. InThe Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=5NTt8GFjUHkr

  37. [37]

    Gonzalez, and Ion Stoica

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.Advances in Neural Information Processing Systems36 (2023), 46595–46623. https: //proceedings.neurips.cc/paper_files/pa...